
1. NLP Understanding
2. Business Understanding
3. Dataset Preparation
4. Data Extraction and DataFrame Preparation
5. Data Understanding
6. Feature Engineering
7. Exploratory Data Analysis (EDA)
8. Text Pre-Processing
9. EDA (After Text Pre-Processing)
10. Extracting vectors from text (Vectorization)
11. Data Splitting – Train Test Split
12. Building ML models For Text-classification
13. Prediction on all ML Model By using different Vectorization technique
14. Deep Neural Network
15. Evaluation on the Model Classification Report and Logloss
16. Finalizing Model Choice
17. Deployment
18. Conclusion & Recommendation
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.
NLP is currently the focus of significant interest in the machine learning community. Some of the use cases for NLP are listed here:
A lot of the data that you could be analyzing is unstructured data and contains human-readable text. Before you can analyze that data programmatically, you first need to preprocess it. In this tutorial, you’ll take your first look at the kinds of text preprocessing tasks you can do with NLTK so that you’ll be ready to apply them in future projects. You’ll also see how to do some basic text analysis and create visualizations.
Text classification is one of the important task in supervised machine learning (ML). It is a process of assigning tags/categories to documents helping us to automatically & quickly structure and analyze text in a cost-effective manner. It is one of the fundamental tasks in Natural Language Processing with broad applications such as sentiment-analysis, spam-detection, topic-labeling, intent-detection etc.
We will build a classifier for predicting the person skills based on the description in the resume.
Company Problem: To classify resume to reduce manual human effort in the HRM and financial department.
objective: The document classification solution should significantly reduce the manual human effort in the HRM and financial department. It should achieve a higher level of accuracy and automation with minimal human intervention
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
from os.path import splitext
import os
import re, string
import pandas as pd
import numpy as np
import docxpy
from tika import parser
import warnings
import seaborn as sns
warnings.filterwarnings('ignore')
#To read .doc file and convert them into .docx
from glob import glob
import win32com.client as win32
from win32com.client import constants
#for text pre-processing
from sklearn.preprocessing import LabelEncoder
import nltk
nltk.download('omw-1.4')
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from wordcloud import WordCloud
from textblob import TextBlob, Word
from nltk.stem import PorterStemmer
nltk.download('stopwords')
stop=set(stopwords.words('english'))
from collections import Counter
from nltk.util import ngrams
#for model-building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.decomposition import TruncatedSVD
from sklearn.svm import SVC
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
import pandas, xgboost, numpy, textblob, string
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers
from tensorflow.keras import optimizers
from tqdm import tqdm
from keras.layers.embeddings import Embedding
from keras.callbacks import EarlyStopping
import tensorflow as tf
from tensorflow.keras.layers import LSTM, Dense, Input, Dropout
from tensorflow.keras.layers import SpatialDropout1D
from tensorflow.keras.optimizers import Adam
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import BatchNormalization
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import LeaveOneOut
#for model Accuracy
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix, mean_absolute_error, mean_squared_error
from sklearn import metrics
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline, ensemble
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import absolute
from numpy import sqrt
# bag of words
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
#for word embedding
import gensim
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')
#for NER
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
#for visualization
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.offline as offline
offline.init_notebook_mode()
from plotly import tools
import plotly.tools as tls
init_notebook_mode(connected=True)
import plotly.express as px
import plotly.graph_objects as go
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
[nltk_data] Downloading package omw-1.4 to [nltk_data] C:\Users\rahul\AppData\Roaming\nltk_data... [nltk_data] Package omw-1.4 is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] C:\Users\rahul\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] C:\Users\rahul\AppData\Roaming\nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date! [nltk_data] Downloading package wordnet to [nltk_data] C:\Users\rahul\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package stopwords to [nltk_data] C:\Users\rahul\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
#Converting .doc file to .docx
def save_as_docx(path):
# Opening MS Word
word = win32.gencache.EnsureDispatch('Word.Application')
doc = word.Documents.Open(path)
doc.Activate ()
# Rename path with .docx
new_file_abs = os.path.abspath(path)
new_file_abs = re.sub(r'\.\w+$', '.docx', new_file_abs)
# Save and Close
word.ActiveDocument.SaveAs(
new_file_abs, FileFormat=constants.wdFormatXMLDocument
)
doc.Close(False)
#Extracting text from .pdf
def splitext_(path):
if len(path.split('.')) > 2:
return path.split('.')[0],'.'.join(path.split('.')[-2:])
return splitext(path)
def text_preprocess(text):
cleaned_text = re.sub(r"[^a-zA-Z]", ' ', text)
return cleaned_text
# extracting text from pdf file
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
try:
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,\
caching=caching, check_extractable=True):
interpreter.process_page(page)
except:
print('This pdf won\'t allow text extraction!')
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
# Peoplesoft resume .doc to .docx conversion
# Create list of paths to .doc files
paths = glob('C:\\Users\\rahul\\Project 114\\Peoplesoft resumes\\*.doc', recursive=True)
for path in paths:
save_as_docx(path)
# PeopleSoft resumes extracting text and creating a .csv file
extracted = []
# Based on the extension of file, extracting text
for foldername,subfolders,files in os.walk(r"C:/Users/rahul/Project 114/Peoplesoft resumes"):
for file_ in files:
dict_ = {}
file_name,extension = splitext_(file_)
if extension == '.pdf':
converted = convert_pdf_to_txt(foldername +"/" + file_)
converted = text_preprocess(converted)
dict_['Extracted'] = converted
dict_['Label'] = foldername.split('/')[-1]
extracted.append(dict_)
elif extension == '.docx':
doc = docxpy.process(foldername +'/'+ file_)
doc = text_preprocess(doc)
dict_['Extracted'] = doc
dict_['Label'] = foldername.split('/')[-1]
extracted.append(dict_)
elif extension == '.ppt':
parsed = parser.from_file(foldername +'/'+ file_)
ppt = parsed["content"]
ppt = text_preprocess(ppt)
dict_['Extracted'] = ppt
dict_['Label'] = foldername.split('/')[-1]
extracted.append(dict_)
df = pd.DataFrame(extracted)
print(df)
df.to_csv('Peoplesoft_resumes.csv')
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
11 Arun Venu EXPERIENCE SUMMARY Exper... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
11 Arun Venu EXPERIENCE SUMMARY Exper... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
11 Arun Venu EXPERIENCE SUMMARY Exper... Peoplesoft resumes
12 Personal Details Name Pritam Biswas Dat... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
11 Arun Venu EXPERIENCE SUMMARY Exper... Peoplesoft resumes
12 Personal Details Name Pritam Biswas Dat... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
11 Arun Venu EXPERIENCE SUMMARY Exper... Peoplesoft resumes
12 Personal Details Name Pritam Biswas Dat... Peoplesoft resumes
13 Rahul Ahuja ... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
11 Arun Venu EXPERIENCE SUMMARY Exper... Peoplesoft resumes
12 Personal Details Name Pritam Biswas Dat... Peoplesoft resumes
13 Rahul Ahuja ... Peoplesoft resumes
14 Hari Narayana ... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
11 Arun Venu EXPERIENCE SUMMARY Exper... Peoplesoft resumes
12 Personal Details Name Pritam Biswas Dat... Peoplesoft resumes
13 Rahul Ahuja ... Peoplesoft resumes
14 Hari Narayana ... Peoplesoft resumes
15 Murali PROFESSIONA... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
11 Arun Venu EXPERIENCE SUMMARY Exper... Peoplesoft resumes
12 Personal Details Name Pritam Biswas Dat... Peoplesoft resumes
13 Rahul Ahuja ... Peoplesoft resumes
14 Hari Narayana ... Peoplesoft resumes
15 Murali PROFESSIONA... Peoplesoft resumes
16 Priyabrata Hota CAREER OBJECTIVE Pur... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
11 Arun Venu EXPERIENCE SUMMARY Exper... Peoplesoft resumes
12 Personal Details Name Pritam Biswas Dat... Peoplesoft resumes
13 Rahul Ahuja ... Peoplesoft resumes
14 Hari Narayana ... Peoplesoft resumes
15 Murali PROFESSIONA... Peoplesoft resumes
16 Priyabrata Hota CAREER OBJECTIVE Pur... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
11 Arun Venu EXPERIENCE SUMMARY Exper... Peoplesoft resumes
12 Personal Details Name Pritam Biswas Dat... Peoplesoft resumes
13 Rahul Ahuja ... Peoplesoft resumes
14 Hari Narayana ... Peoplesoft resumes
15 Murali PROFESSIONA... Peoplesoft resumes
16 Priyabrata Hota CAREER OBJECTIVE Pur... Peoplesoft resumes
17 R Ahmed ... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
11 Arun Venu EXPERIENCE SUMMARY Exper... Peoplesoft resumes
12 Personal Details Name Pritam Biswas Dat... Peoplesoft resumes
13 Rahul Ahuja ... Peoplesoft resumes
14 Hari Narayana ... Peoplesoft resumes
15 Murali PROFESSIONA... Peoplesoft resumes
16 Priyabrata Hota CAREER OBJECTIVE Pur... Peoplesoft resumes
17 R Ahmed ... Peoplesoft resumes
18 Tanna Sujatha OBJECTIVE Seeking a cha... Peoplesoft resumes
Extracted Label
0 Anubhav Kumar Singh Core Competencies ... Peoplesoft resumes
1 G Ananda Rayudu https www linked... Peoplesoft resumes
2 PeopleSoft Database Administrator ... Peoplesoft resumes
3 Classification Internal Classification Inte... Peoplesoft resumes
4 Priyanka Ramadoss MountPleasant C... Peoplesoft resumes
5 SIRAZUDDIN M Bangalore INDIA SIRAZUDD... Peoplesoft resumes
6 PEOPLESOFT Administrator SRINIVAS K ... Peoplesoft resumes
7 PeopleSoft Admin VARKALA VIKAS Career Obje... Peoplesoft resumes
8 Vinod Akkala ... Peoplesoft resumes
9 PeopleSoft Admin PeopleSoft DBA Ganesh Alla... Peoplesoft resumes
10 PeopleSoft Administration Vivekanand Sayan... Peoplesoft resumes
11 Arun Venu EXPERIENCE SUMMARY Exper... Peoplesoft resumes
12 Personal Details Name Pritam Biswas Dat... Peoplesoft resumes
13 Rahul Ahuja ... Peoplesoft resumes
14 Hari Narayana ... Peoplesoft resumes
15 Murali PROFESSIONA... Peoplesoft resumes
16 Priyabrata Hota CAREER OBJECTIVE Pur... Peoplesoft resumes
17 R Ahmed ... Peoplesoft resumes
18 Tanna Sujatha OBJECTIVE Seeking a cha... Peoplesoft resumes
19 C O N T A C T Address Manyata Tech Park ... Peoplesoft resumes
# SQL Developer Lightning insight resume .doc to .docx conversion
# Create list of paths to .doc files
paths = glob('C:\\Users\\rahul\\Project 114\\SQL Developer Lightning insight\\*.doc', recursive=True)
for path in paths:
save_as_docx(path)
# SQL Developer Lightning insight resumes
extracted2 = []
# Based on the extension of file, extracting text
for foldername,subfolders,files in os.walk(r"C:/Users/rahul/Project 114/SQL Developer Lightning insight"):
for file_ in files:
dict_ = {}
file_name,extension = splitext_(file_)
if extension == '.pdf':
converted = convert_pdf_to_txt(foldername +"/" + file_)
converted = text_preprocess(converted)
dict_['Extracted'] = converted
dict_['Label'] = foldername.split('/')[-1]
extracted2.append(dict_)
elif extension == '.docx':
doc = docxpy.process(foldername +'/'+ file_)
doc = text_preprocess(doc)
dict_['Extracted'] = doc
dict_['Label'] = foldername.split('/')[-1]
extracted2.append(dict_)
elif extension == '.ppt':
parsed = parser.from_file(foldername +'/'+ file_)
ppt = parsed["content"]
ppt = text_preprocess(ppt)
dict_['Extracted'] = ppt
dict_['Label'] = foldername.split('/')[-1]
extracted2.append(dict_)
df = pd.DataFrame(extracted2)
print(df)
df.to_csv('SQL_Developer.csv')
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
Label
0 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
4 Hyderabad Nazeer Basha SQL and Power BI D...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
4 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
4 Hyderabad Nazeer Basha SQL and Power BI D...
5 Resume Name Neeraj Mishra Experienc...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
4 SQL Developer Lightning insight
5 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
4 Hyderabad Nazeer Basha SQL and Power BI D...
5 Resume Name Neeraj Mishra Experienc...
6 SQL DEVELOPER Name Bandi prem sai Car...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
4 SQL Developer Lightning insight
5 SQL Developer Lightning insight
6 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
4 Hyderabad Nazeer Basha SQL and Power BI D...
5 Resume Name Neeraj Mishra Experienc...
6 SQL DEVELOPER Name Bandi prem sai Car...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
4 SQL Developer Lightning insight
5 SQL Developer Lightning insight
6 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
4 Hyderabad Nazeer Basha SQL and Power BI D...
5 Resume Name Neeraj Mishra Experienc...
6 SQL DEVELOPER Name Bandi prem sai Car...
7 SQL SERVER DEVELOPER Priyanka L ...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
4 SQL Developer Lightning insight
5 SQL Developer Lightning insight
6 SQL Developer Lightning insight
7 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
4 Hyderabad Nazeer Basha SQL and Power BI D...
5 Resume Name Neeraj Mishra Experienc...
6 SQL DEVELOPER Name Bandi prem sai Car...
7 SQL SERVER DEVELOPER Priyanka L ...
8 SQL SERVER DEVELOPER P Syam Kumar ...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
4 SQL Developer Lightning insight
5 SQL Developer Lightning insight
6 SQL Developer Lightning insight
7 SQL Developer Lightning insight
8 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
4 Hyderabad Nazeer Basha SQL and Power BI D...
5 Resume Name Neeraj Mishra Experienc...
6 SQL DEVELOPER Name Bandi prem sai Car...
7 SQL SERVER DEVELOPER Priyanka L ...
8 SQL SERVER DEVELOPER P Syam Kumar ...
9 RAJU PAVANA KUMARI Professional Summary...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
4 SQL Developer Lightning insight
5 SQL Developer Lightning insight
6 SQL Developer Lightning insight
7 SQL Developer Lightning insight
8 SQL Developer Lightning insight
9 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
4 Hyderabad Nazeer Basha SQL and Power BI D...
5 Resume Name Neeraj Mishra Experienc...
6 SQL DEVELOPER Name Bandi prem sai Car...
7 SQL SERVER DEVELOPER Priyanka L ...
8 SQL SERVER DEVELOPER P Syam Kumar ...
9 RAJU PAVANA KUMARI Professional Summary...
10 resume Ramalakshmi K ...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
4 SQL Developer Lightning insight
5 SQL Developer Lightning insight
6 SQL Developer Lightning insight
7 SQL Developer Lightning insight
8 SQL Developer Lightning insight
9 SQL Developer Lightning insight
10 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
4 Hyderabad Nazeer Basha SQL and Power BI D...
5 Resume Name Neeraj Mishra Experienc...
6 SQL DEVELOPER Name Bandi prem sai Car...
7 SQL SERVER DEVELOPER Priyanka L ...
8 SQL SERVER DEVELOPER P Syam Kumar ...
9 RAJU PAVANA KUMARI Professional Summary...
10 resume Ramalakshmi K ...
11 Name Ramesh Career Objective ...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
4 SQL Developer Lightning insight
5 SQL Developer Lightning insight
6 SQL Developer Lightning insight
7 SQL Developer Lightning insight
8 SQL Developer Lightning insight
9 SQL Developer Lightning insight
10 SQL Developer Lightning insight
11 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
4 Hyderabad Nazeer Basha SQL and Power BI D...
5 Resume Name Neeraj Mishra Experienc...
6 SQL DEVELOPER Name Bandi prem sai Car...
7 SQL SERVER DEVELOPER Priyanka L ...
8 SQL SERVER DEVELOPER P Syam Kumar ...
9 RAJU PAVANA KUMARI Professional Summary...
10 resume Ramalakshmi K ...
11 Name Ramesh Career Objective ...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
4 SQL Developer Lightning insight
5 SQL Developer Lightning insight
6 SQL Developer Lightning insight
7 SQL Developer Lightning insight
8 SQL Developer Lightning insight
9 SQL Developer Lightning insight
10 SQL Developer Lightning insight
11 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
4 Hyderabad Nazeer Basha SQL and Power BI D...
5 Resume Name Neeraj Mishra Experienc...
6 SQL DEVELOPER Name Bandi prem sai Car...
7 SQL SERVER DEVELOPER Priyanka L ...
8 SQL SERVER DEVELOPER P Syam Kumar ...
9 RAJU PAVANA KUMARI Professional Summary...
10 resume Ramalakshmi K ...
11 Name Ramesh Career Objective ...
12 Tatikonda Kiran Kumar Career objecti...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
4 SQL Developer Lightning insight
5 SQL Developer Lightning insight
6 SQL Developer Lightning insight
7 SQL Developer Lightning insight
8 SQL Developer Lightning insight
9 SQL Developer Lightning insight
10 SQL Developer Lightning insight
11 SQL Developer Lightning insight
12 SQL Developer Lightning insight
Extracted \
0 ANIL KUMAR MADDUKURI SQL MSBI Developer...
1 Aradhana Tripathi Current Location Gachibo...
2 BUDDHA VAMSI ...
3 KAMBALLA PRADEEP ...
4 Hyderabad Nazeer Basha SQL and Power BI D...
5 Resume Name Neeraj Mishra Experienc...
6 SQL DEVELOPER Name Bandi prem sai Car...
7 SQL SERVER DEVELOPER Priyanka L ...
8 SQL SERVER DEVELOPER P Syam Kumar ...
9 RAJU PAVANA KUMARI Professional Summary...
10 resume Ramalakshmi K ...
11 Name Ramesh Career Objective ...
12 Tatikonda Kiran Kumar Career objecti...
13 SQL AND MSBI DEVELOPER SQL AND MSBI DEVELOPER...
Label
0 SQL Developer Lightning insight
1 SQL Developer Lightning insight
2 SQL Developer Lightning insight
3 SQL Developer Lightning insight
4 SQL Developer Lightning insight
5 SQL Developer Lightning insight
6 SQL Developer Lightning insight
7 SQL Developer Lightning insight
8 SQL Developer Lightning insight
9 SQL Developer Lightning insight
10 SQL Developer Lightning insight
11 SQL Developer Lightning insight
12 SQL Developer Lightning insight
13 SQL Developer Lightning insight
# workday resumes .doc to .docx conversion
# Create list of paths to .doc files
paths = glob('C:\\Users\\rahul\\Project 114\\workday resumes\\*.doc', recursive=True)
for path in paths:
save_as_docx(path)
# workday resumes resumes
extracted3 = []
# Based on the extension of file, extracting text
for foldername,subfolders,files in os.walk(r"C:/Users/rahul/Project 114/workday resumes"):
for file_ in files:
dict_ = {}
file_name,extension = splitext_(file_)
if extension == '.pdf':
converted = convert_pdf_to_txt(foldername +"/" + file_)
converted = text_preprocess(converted)
dict_['Extracted'] = converted
dict_['Label'] = foldername.split('/')[-1]
extracted3.append(dict_)
elif extension == '.docx':
doc = docxpy.process(foldername +'/'+ file_)
doc = text_preprocess(doc)
dict_['Extracted'] = doc
dict_['Label'] = foldername.split('/')[-1]
extracted3.append(dict_)
elif extension == '.ppt':
parsed = parser.from_file(foldername +'/'+ file_)
ppt = parsed["content"]
ppt = text_preprocess(ppt)
dict_['Extracted'] = ppt
dict_['Label'] = foldername.split('/')[-1]
extracted3.append(dict_)
df = pd.DataFrame(extracted3)
print(df)
df.to_csv('workday_resumes.csv')
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
11 Punugoti Swetha Workday Technical Consultant ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
11 Punugoti Swetha Workday Technical Consultant ... workday resumes
12 Workday HCM Techno functional Consultant ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
11 Punugoti Swetha Workday Technical Consultant ... workday resumes
12 Workday HCM Techno functional Consultant ... workday resumes
13 Ramesh A Workday HCM Cons... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
11 Punugoti Swetha Workday Technical Consultant ... workday resumes
12 Workday HCM Techno functional Consultant ... workday resumes
13 Ramesh A Workday HCM Cons... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
11 Punugoti Swetha Workday Technical Consultant ... workday resumes
12 Workday HCM Techno functional Consultant ... workday resumes
13 Ramesh A Workday HCM Cons... workday resumes
14 Shireesh Balasani ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
11 Punugoti Swetha Workday Technical Consultant ... workday resumes
12 Workday HCM Techno functional Consultant ... workday resumes
13 Ramesh A Workday HCM Cons... workday resumes
14 Shireesh Balasani ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
11 Punugoti Swetha Workday Technical Consultant ... workday resumes
12 Workday HCM Techno functional Consultant ... workday resumes
13 Ramesh A Workday HCM Cons... workday resumes
14 Shireesh Balasani ... workday resumes
15 Workday Integration Consultant Name ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
11 Punugoti Swetha Workday Technical Consultant ... workday resumes
12 Workday HCM Techno functional Consultant ... workday resumes
13 Ramesh A Workday HCM Cons... workday resumes
14 Shireesh Balasani ... workday resumes
15 Workday Integration Consultant Name ... workday resumes
16 Srikanth WORKDAY hCM Consultant ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
11 Punugoti Swetha Workday Technical Consultant ... workday resumes
12 Workday HCM Techno functional Consultant ... workday resumes
13 Ramesh A Workday HCM Cons... workday resumes
14 Shireesh Balasani ... workday resumes
15 Workday Integration Consultant Name ... workday resumes
16 Srikanth WORKDAY hCM Consultant ... workday resumes
17 WORKDAY HCM FCM Name Kumar S S Role ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
11 Punugoti Swetha Workday Technical Consultant ... workday resumes
12 Workday HCM Techno functional Consultant ... workday resumes
13 Ramesh A Workday HCM Cons... workday resumes
14 Shireesh Balasani ... workday resumes
15 Workday Integration Consultant Name ... workday resumes
16 Srikanth WORKDAY hCM Consultant ... workday resumes
17 WORKDAY HCM FCM Name Kumar S S Role ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
11 Punugoti Swetha Workday Technical Consultant ... workday resumes
12 Workday HCM Techno functional Consultant ... workday resumes
13 Ramesh A Workday HCM Cons... workday resumes
14 Shireesh Balasani ... workday resumes
15 Workday Integration Consultant Name ... workday resumes
16 Srikanth WORKDAY hCM Consultant ... workday resumes
17 WORKDAY HCM FCM Name Kumar S S Role ... workday resumes
18 Venkateswarlu B Workday Consultant ... workday resumes
Extracted Label
0 Chinna Subbarayudu M DOB th March Na... workday resumes
1 Name Gopi Krishna Reddy ... workday resumes
2 Hari Krishna M Summary A result oriente... workday resumes
3 Harikrishna Akula ... workday resumes
4 HIMA MENDU Career Objective To contin... workday resumes
5 G Himaja ... workday resumes
6 JYOTI VERMA PROFESSIONAL SUMMARY PROF... workday resumes
7 Madeeswar A PROFILE SUMMARY Hav... workday resumes
8 Mooraboyina Guravaiah Workday Integration Spe... workday resumes
9 Name Naresh Babu Cherukuri Objective To... workday resumes
10 VENKATA SAIKRISHNA Workday Consultant P... workday resumes
11 Punugoti Swetha Workday Technical Consultant ... workday resumes
12 Workday HCM Techno functional Consultant ... workday resumes
13 Ramesh A Workday HCM Cons... workday resumes
14 Shireesh Balasani ... workday resumes
15 Workday Integration Consultant Name ... workday resumes
16 Srikanth WORKDAY hCM Consultant ... workday resumes
17 WORKDAY HCM FCM Name Kumar S S Role ... workday resumes
18 Venkateswarlu B Workday Consultant ... workday resumes
19 Vinay kumar v Workday Functional Consultant ... workday resumes
# React JS .doc to .docx conversion
# Create list of paths to .doc files
paths = glob('C:\\Users\\rahul\\Project 114\\React JS\\*.doc', recursive=True)
for path in paths:
save_as_docx(path)
# React JS resumes
extracted4 = []
# Based on the extension of file, extracting text
for foldername,subfolders,files in os.walk(r"C:/Users/rahul/Project 114/React JS"):
for file_ in files:
dict_ = {}
file_name,extension = splitext_(file_)
if extension == '.pdf':
converted = convert_pdf_to_txt(foldername +"/" + file_)
converted = text_preprocess(converted)
dict_['Extracted'] = converted
dict_['Label'] = foldername.split('/')[-1]
extracted4.append(dict_)
elif extension == '.docx':
doc = docxpy.process(foldername +'/'+ file_)
doc = text_preprocess(doc)
dict_['Extracted'] = doc
dict_['Label'] = foldername.split('/')[-1]
extracted4.append(dict_)
elif extension == '.ppt':
parsed = parser.from_file(foldername +'/'+ file_)
ppt = parsed["content"]
ppt = text_preprocess(ppt)
dict_['Extracted'] = ppt
dict_['Label'] = foldername.split('/')[-1]
extracted4.append(dict_)
df = pd.DataFrame(extracted4)
print(df)
df.to_csv('React_JS.csv')
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
10 Ui Developer React JS Developer NAME KRISH... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
10 Ui Developer React JS Developer NAME KRISH... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
10 Ui Developer React JS Developer NAME KRISH... React JS
11 CURRICULUM VITAE Anjani Priyadarshini ... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
10 Ui Developer React JS Developer NAME KRISH... React JS
11 CURRICULUM VITAE Anjani Priyadarshini ... React JS
12 Kotani Durga Prasad Objective Aspiran... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
10 Ui Developer React JS Developer NAME KRISH... React JS
11 CURRICULUM VITAE Anjani Priyadarshini ... React JS
12 Kotani Durga Prasad Objective Aspiran... React JS
13 Venkatalakshmi Pedireddy Software Developer ... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
10 Ui Developer React JS Developer NAME KRISH... React JS
11 CURRICULUM VITAE Anjani Priyadarshini ... React JS
12 Kotani Durga Prasad Objective Aspiran... React JS
13 Venkatalakshmi Pedireddy Software Developer ... React JS
14 KAMBALA SAI SURENDRA SUMMAR... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
10 Ui Developer React JS Developer NAME KRISH... React JS
11 CURRICULUM VITAE Anjani Priyadarshini ... React JS
12 Kotani Durga Prasad Objective Aspiran... React JS
13 Venkatalakshmi Pedireddy Software Developer ... React JS
14 KAMBALA SAI SURENDRA SUMMAR... React JS
15 MAREEDU LOKESH BABU PROFESSIONAL OVERVIEW ... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
10 Ui Developer React JS Developer NAME KRISH... React JS
11 CURRICULUM VITAE Anjani Priyadarshini ... React JS
12 Kotani Durga Prasad Objective Aspiran... React JS
13 Venkatalakshmi Pedireddy Software Developer ... React JS
14 KAMBALA SAI SURENDRA SUMMAR... React JS
15 MAREEDU LOKESH BABU PROFESSIONAL OVERVIEW ... React JS
16 MAREEDU LOKESH BABU PROFESSIONAL OVERVIEW ... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
10 Ui Developer React JS Developer NAME KRISH... React JS
11 CURRICULUM VITAE Anjani Priyadarshini ... React JS
12 Kotani Durga Prasad Objective Aspiran... React JS
13 Venkatalakshmi Pedireddy Software Developer ... React JS
14 KAMBALA SAI SURENDRA SUMMAR... React JS
15 MAREEDU LOKESH BABU PROFESSIONAL OVERVIEW ... React JS
16 MAREEDU LOKESH BABU PROFESSIONAL OVERVIEW ... React JS
17 MD KHIZARUDDIN RAUF EXPERIENCE ... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
10 Ui Developer React JS Developer NAME KRISH... React JS
11 CURRICULUM VITAE Anjani Priyadarshini ... React JS
12 Kotani Durga Prasad Objective Aspiran... React JS
13 Venkatalakshmi Pedireddy Software Developer ... React JS
14 KAMBALA SAI SURENDRA SUMMAR... React JS
15 MAREEDU LOKESH BABU PROFESSIONAL OVERVIEW ... React JS
16 MAREEDU LOKESH BABU PROFESSIONAL OVERVIEW ... React JS
17 MD KHIZARUDDIN RAUF EXPERIENCE ... React JS
18 Name M Prabakaran Title UI Develo... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
10 Ui Developer React JS Developer NAME KRISH... React JS
11 CURRICULUM VITAE Anjani Priyadarshini ... React JS
12 Kotani Durga Prasad Objective Aspiran... React JS
13 Venkatalakshmi Pedireddy Software Developer ... React JS
14 KAMBALA SAI SURENDRA SUMMAR... React JS
15 MAREEDU LOKESH BABU PROFESSIONAL OVERVIEW ... React JS
16 MAREEDU LOKESH BABU PROFESSIONAL OVERVIEW ... React JS
17 MD KHIZARUDDIN RAUF EXPERIENCE ... React JS
18 Name M Prabakaran Title UI Develo... React JS
19 Pranish Sonone Career summary ... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
10 Ui Developer React JS Developer NAME KRISH... React JS
11 CURRICULUM VITAE Anjani Priyadarshini ... React JS
12 Kotani Durga Prasad Objective Aspiran... React JS
13 Venkatalakshmi Pedireddy Software Developer ... React JS
14 KAMBALA SAI SURENDRA SUMMAR... React JS
15 MAREEDU LOKESH BABU PROFESSIONAL OVERVIEW ... React JS
16 MAREEDU LOKESH BABU PROFESSIONAL OVERVIEW ... React JS
17 MD KHIZARUDDIN RAUF EXPERIENCE ... React JS
18 Name M Prabakaran Title UI Develo... React JS
19 Pranish Sonone Career summary ... React JS
20 Ranga Gaganam Professional Summary... React JS
Extracted Label
0 Kanumuru Deepak Reddy CAREER OBJECTIVE... React JS
1 HARIPRIYA BATTINA Experience as UI Developer... React JS
2 KAMALAKAR REDDY A Linked In https www li... React JS
3 Naveen Sadhu Title software developer ... React JS
4 FULLSTACK SOFTWARE DEVELOPER WEB DEVELOPER ... React JS
5 PRAGNYA PATTNAIK Expertise Ha... React JS
6 SARALA MADASU SARALA MADASU Sri geethi... React JS
7 Thirupathamma Balla SUMMARY year of... React JS
8 Maryala Vinay Reddy Professional Summary ... React JS
9 Ui Developer React JS Developer NAME KRISH... React JS
10 Ui Developer React JS Developer NAME KRISH... React JS
11 CURRICULUM VITAE Anjani Priyadarshini ... React JS
12 Kotani Durga Prasad Objective Aspiran... React JS
13 Venkatalakshmi Pedireddy Software Developer ... React JS
14 KAMBALA SAI SURENDRA SUMMAR... React JS
15 MAREEDU LOKESH BABU PROFESSIONAL OVERVIEW ... React JS
16 MAREEDU LOKESH BABU PROFESSIONAL OVERVIEW ... React JS
17 MD KHIZARUDDIN RAUF EXPERIENCE ... React JS
18 Name M Prabakaran Title UI Develo... React JS
19 Pranish Sonone Career summary ... React JS
20 Ranga Gaganam Professional Summary... React JS
21 SHAIK ABDUL SHARUK years Experience in ... React JS
# Internship .doc to .docx conversion
# Create list of paths to .doc files
paths = glob('C:\\Users\\rahul\\Project 114\\Internship\\*.doc', recursive=True)
for path in paths:
save_as_docx(path)
# Internship resumes
extracted5 = []
# Based on the extension of file, extracting text
for foldername,subfolders,files in os.walk(r"C:/Users/rahul/Project 114/Internship"):
for file_ in files:
dict_ = {}
file_name,extension = splitext_(file_)
if extension == '.pdf':
converted = convert_pdf_to_txt(foldername +"/" + file_)
converted = text_preprocess(converted)
dict_['Extracted'] = converted
dict_['Label'] = foldername.split('/')[-1]
extracted5.append(dict_)
elif extension == '.docx':
doc = docxpy.process(foldername +'/'+ file_)
doc = text_preprocess(doc)
dict_['Extracted'] = doc
dict_['Label'] = foldername.split('/')[-1]
extracted5.append(dict_)
elif extension == '.ppt':
parsed = parser.from_file(foldername +'/'+ file_)
ppt = parsed["content"]
ppt = text_preprocess(ppt)
dict_['Extracted'] = ppt
dict_['Label'] = foldername.split('/')[-1]
extracted5.append(dict_)
df = pd.DataFrame(extracted5)
print(df)
df.to_csv('Internship.csv')
Extracted Label
0 Name Ravali P ... Internship
Extracted Label
0 Name Ravali P ... Internship
1 SUSOVAN BAG Seeking a challenging posi... Internship
df1= pd.read_csv('Peoplesoft_resumes.csv')
df2= pd.read_csv('SQL_Developer.csv')
df3= pd.read_csv('workday_resumes.csv')
df4= pd.read_csv('React_JS.csv')
df5= pd.read_csv('Internship.csv')
NLP_data = pd.concat([df1, df2, df3, df4, df5], axis=0)
NLP_data = NLP_data.drop('Unnamed: 0', axis=1)
NLP_data.reset_index(inplace=True, drop=True)
NLP_data
| Extracted | Label | |
|---|---|---|
| 0 | Anubhav Kumar Singh Core Competencies ... | Peoplesoft resumes |
| 1 | G Ananda Rayudu https www linked... | Peoplesoft resumes |
| 2 | PeopleSoft Database Administrator ... | Peoplesoft resumes |
| 3 | Classification Internal Classification Inte... | Peoplesoft resumes |
| 4 | Priyanka Ramadoss MountPleasant C... | Peoplesoft resumes |
| ... | ... | ... |
| 73 | Pranish Sonone Career summary ... | React JS |
| 74 | Ranga Gaganam Professional Summary... | React JS |
| 75 | SHAIK ABDUL SHARUK years Experience in ... | React JS |
| 76 | Name Ravali P ... | Internship |
| 77 | SUSOVAN BAG Seeking a challenging posi... | Internship |
78 rows × 2 columns
NLP_data.shape
(78, 2)
NLP_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 78 entries, 0 to 77 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Extracted 78 non-null object 1 Label 78 non-null object dtypes: object(2) memory usage: 1.3+ KB
#Count Values of Labels
NLP_data['Label'].value_counts()
React JS 22 Peoplesoft resumes 20 workday resumes 20 SQL Developer Lightning insight 14 Internship 2 Name: Label, dtype: int64
# All over words consists
NLP_data.index = range(78)
NLP_data['Extracted'].apply(lambda x: len(x.split(' '))).sum()
104963
NLP_data.isna().sum()
Extracted 0 Label 0 dtype: int64
VARIABLE DESCRIPTIONS:
We have :-
| | |
| ---: | ---: |
|1. Peoplesoft resumes | 20 |
|2. SQL Developer Lightning insight | 14 |
|3. workday resumes | 20 |
|4. React JS | 22 |
|5. Internship | 2 |
Combining all resume we have 104963 words.
NLP_data.describe()
| Extracted | Label | |
|---|---|---|
| count | 78 | 78 |
| unique | 78 | 5 |
| top | Anubhav Kumar Singh Core Competencies ... | React JS |
| freq | 1 | 22 |
All over there are 78 resume and 5 unique Label.
#Convert Labels Name into Numerical Index
# Associate Labels names with numerical index and save it in new column LabelId
target_category = NLP_data['Label'].unique()
print(target_category)
['Peoplesoft resumes' 'SQL Developer Lightning insight' 'workday resumes' 'React JS' 'Internship']
NLP_data['LabelId'] = NLP_data['Label'].factorize()[0]
NLP_data.head()
| Extracted | Label | LabelId | |
|---|---|---|---|
| 0 | Anubhav Kumar Singh Core Competencies ... | Peoplesoft resumes | 0 |
| 1 | G Ananda Rayudu https www linked... | Peoplesoft resumes | 0 |
| 2 | PeopleSoft Database Administrator ... | Peoplesoft resumes | 0 |
| 3 | Classification Internal Classification Inte... | Peoplesoft resumes | 0 |
| 4 | Priyanka Ramadoss MountPleasant C... | Peoplesoft resumes | 0 |
# Create a new pandas dataframe "category", which only has unique Categories, also sorting this list in order of CategoryId values
category = NLP_data[['Label', 'LabelId']].drop_duplicates().sort_values('LabelId')
category
| Label | LabelId | |
|---|---|---|
| 0 | Peoplesoft resumes | 0 |
| 20 | SQL Developer Lightning insight | 1 |
| 34 | workday resumes | 2 |
| 54 | React JS | 3 |
| 76 | Internship | 4 |
NLP_data['Label'].value_counts().iplot(kind='bar',bins=100,xTitle='Label',linecolor='black',yTitle='Number of Occurrences',title='Number Of Resume Under Each Label')
Inferences: Class distribution- There are more resumes of React JS. The dataset is little imbalanced. We will be applying data balancing technique like SMOTE while building the model.
NLP_data.groupby('Label').LabelId.value_counts().plot(kind = "bar", color = ["pink", "orange", "red", "yellow", "blue"])
plt.xlabel("Label of data")
plt.title("Visulaize numbers of Label of data")
plt.show()
labels = list(NLP_data['Label'].value_counts().index)[0:]
values = list(NLP_data['Label'].value_counts().values)[0:]
colors = ['lightblue','gray','#eee','#999', '#9f9f']
trace = go.Pie(labels=labels, values=values, hoverinfo='label+percent',
textinfo='value', name='Resume counts of different category',
marker=dict(colors=colors))
layout = dict(title = 'Distribution of Resume',
xaxis= dict(title= 'Resume',ticklen= 5,zeroline= False)
)
fig = dict(data = [trace], layout = layout)
iplot(fig)
plt.figure(figsize=(15,8))
plt.title('Percentage of Resume', fontsize=20)
NLP_data.Label.value_counts().plot(kind='pie', labels=['React_JS', 'Peoplesoft', 'Workday', 'Sql_Developer', 'Internship'],
wedgeprops=dict(width=.7), autopct="%1.1f%%", startangle= -20,
textprops={'fontsize': 15})
<AxesSubplot:title={'center':'Percentage of Resume'}, ylabel='Label'>
Inference: The pie chart shows percentage of data from the dataset under each category.
NLP_data['Extracted'].str.len().iplot(kind='hist',bins=100,xTitle='character count',linecolor='black',yTitle='count',title='Resume Text Character Count Distribution')
Inference: The histogram shows the number of characters present in each resume. The histogram shows that resume range from 2k to 18k characters and generally it is between 2.5k to 8k characters.
NLP_data['Extracted'].str.split().map(lambda x: len(x)).iplot(kind='hist',bins=100,xTitle='word count',linecolor='black',yTitle='count',title='Resume Text Word Count Distribution')
Inferences: Data exploration at a word-level. It is clear that the number of words in resume ranges from 400 to 2500 and mostly falls between 500 to 700 words.
NLP_data['Extracted'].str.split().apply(lambda x : [len(i) for i in x]).map(lambda x: np.mean(x)).iplot(kind='hist',bins=100,xTitle='word count',linecolor='black',yTitle='count' ,title='Resume Text Average Word Count Distribution')
Inference: The average word length ranges between 3.5 to 6.5 with 6 being the most common length.
corpus=[]
new= NLP_data['Extracted'].str.split()
new=new.values.tolist()
corpus=[word for i in new for word in i]
from collections import defaultdict
dic=defaultdict(int)
for word in corpus:
if word in stop:
dic[word]+=1
top=sorted(dic.items(), key=lambda x:x[1],reverse=True)[:10]
x,y=zip(*top)
plt.bar(x,y)
<BarContainer object of 10 artists>
def get_top_n_words(corpus, n=None):
vec = CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_words(NLP_data['Extracted'], 10)
for word, freq in common_words:
print(word, freq)
df1 = pd.DataFrame(common_words, columns = ['ResumeText' , 'count'])
df1.groupby('ResumeText').sum()['count'].sort_values(ascending=True).iplot(kind='bar', yTitle='Count', linecolor='black',orientation='h', title='Top 10 words in resume before removing stop words')
and 2848 the 1392 in 1311 to 1128 of 1046 on 681 for 680 experience 590 peoplesoft 453 with 428
Inference: We can evidently see that stopwords such as “and”,”the” and “in” dominate in resume.
counter=Counter(corpus)
most=counter.most_common()
plt.figure(figsize=(9,10))
x, y= [], []
for word,count in most[:40]:
if (word not in stop):
x.append(word)
y.append(count)
sns.barplot(x=y,y=x)
plt.show()
Inference: From above we can see which words occur most frequently. If we do our word level analysis we can see PeopleSoft occur most frequently in the resume and than Workday, SQL and least ReactJS.
#The distribution of top Bigrams before removing stop words
def get_top_n_bigram(corpus, n=None):
vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_bigram(NLP_data['Extracted'], 10)
for word, freq in common_words:
print(word, freq)
df3 = pd.DataFrame(common_words, columns = ['ResumeText' , 'count'])
df3.groupby('ResumeText').sum()['count'].sort_values(ascending=True).iplot(kind='bar', yTitle='Count', linecolor='black',orientation='h', title='Top 10 bigrams in resume before cleaning data')
experience in 339 involved in 172 of the 144 worked on 131 sql server 117 process scheduler 107 to the 102 react js 100 in the 98 application server 93
def plot_top_ngrams_barchart(text, n=2):
stop=set(stopwords.words('english'))
new= text.str.split()
new=new.values.tolist()
corpus=[word for i in new for word in i]
def _get_top_ngram(corpus, n=None):
vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx])
for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:20]
top_n_bigrams=_get_top_ngram(text,n)[:10]
x,y=map(list,zip(*top_n_bigrams))
sns.barplot(x=y,y=x)
plot_top_ngrams_barchart(NLP_data['Extracted'],2) #bigrams
Inference: We can see 'SQL Server' and 'React JS' are mostly related with each other. We got mostly 'experience in' word that means mostly the resume are received are of experience person.
#The distribution of top trigrams after removing stop words
def get_top_n_trigram(corpus, n=None):
vec = CountVectorizer(ngram_range=(3, 3)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_trigram(NLP_data['Extracted'], 10)
for word, freq in common_words:
print(word, freq)
df5 = pd.DataFrame(common_words, columns = ['ResumeText' , 'count'])
df5.groupby('ResumeText').sum()['count'].sort_values(ascending=True).iplot(
kind='bar', yTitle='Count', linecolor='black',orientation='h', title='Top 10 trigrams in resume Before cleaning data')
hands on experience 50 on experience in 42 years of experience 41 to till date 41 of application server 40 application server domains 37 process scheduler servers 36 and process scheduler 36 day to day 35 of experience in 35
plot_top_ngrams_barchart(NLP_data['Extracted'],3) #trigrams
Inference: We can see that many of these trigrams are some combinations of word 'experience'
stop = set(stopwords.words('english'))
Peoplesoft = NLP_data[NLP_data['LabelId'] == 0]
Peoplesoft = Peoplesoft['Extracted']
SQL_Developer = NLP_data[NLP_data['LabelId'] == 1]
SQL_Developer = SQL_Developer['Extracted']
Workday = NLP_data[NLP_data['LabelId'] == 2]
Workday = Workday['Extracted']
React_JS = NLP_data[NLP_data['LabelId'] == 3]
React_JS = React_JS['Extracted']
Internship = NLP_data[NLP_data['LabelId'] == 4]
Internship = Internship['Extracted']
def wordcloud_draw(dataset, color = 'white'):
words = ' '.join(dataset)
cleaned_word = ' '.join([word for word in words.split()
if (word != 'news' and word != 'text')])
wordcloud = WordCloud(stopwords = stop,
background_color = color,
width = 2500, height = 2500).generate(cleaned_word)
plt.figure(1, figsize = (10,7))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
print("Peoplesoft related words:")
wordcloud_draw(Peoplesoft, 'white')
print("SQL_Developer related words:")
wordcloud_draw(SQL_Developer, 'white')
print("Workday related words:")
wordcloud_draw(Workday, 'white')
print("React_JS related words:")
wordcloud_draw(React_JS, 'white')
print("Internship related words:")
wordcloud_draw(Internship, 'white')
Peoplesoft related words:
SQL_Developer related words:
Workday related words:
React_JS related words:
Internship related words:
Inference: At a glance we can see the most relevant word of our data which are highly indicative to classified our resume. As we can see internship resume do not consists of any software skills.
# WORD-COUNT
NLP_data['word_count'] = NLP_data['Extracted'].apply(lambda x: len(str(x).split()))
print("Peoplesoft", NLP_data[NLP_data['LabelId']==0]['word_count'].mean())
print("SQL Developer", NLP_data[NLP_data['LabelId']==1]['word_count'].mean())
print("Workday", NLP_data[NLP_data['LabelId']==2]['word_count'].mean())
print("React JS", NLP_data[NLP_data['LabelId']==3]['word_count'].mean())
print("Internship", NLP_data[NLP_data['LabelId']==4]['word_count'].mean())
Peoplesoft 925.55 SQL Developer 623.9285714285714 Workday 849.45 React JS 430.8636363636364 Internship 495.5
# PLOTTING WORD-COUNT
fig,axs=plt.subplots(3,2,figsize=(10,10))
train_words=NLP_data[NLP_data['LabelId']==0]['word_count']
axs[0,0].hist(train_words,color='red')
axs[0,0].set_title('Peoplesoft')
train_words=NLP_data[NLP_data['LabelId']==1]['word_count']
axs[0,1].hist(train_words,color='green')
axs[0,1].set_title('SQL_Developer')
train_words=NLP_data[NLP_data['LabelId']==2]['word_count']
axs[1,0].hist(train_words,color='lightblue')
axs[1,0].set_title('Workday')
train_words=NLP_data[NLP_data['LabelId']==3]['word_count']
axs[1,1].hist(train_words,color='lightblue')
axs[1,1].set_title('React_JS')
train_words=NLP_data[NLP_data['LabelId']==4]['word_count']
axs[2,0].hist(train_words,color='lightblue')
axs[2,0].set_title('Internship')
fig.suptitle('Resume Word Count Under Each Label')
plt.show()
Inference: Visualizing the Word count:- In this we can clearly see the word consists in the resume, and Peoplesoft resume has the maximum of all.
# CHARACTER-COUNT
NLP_data['char_count'] = NLP_data['Extracted'].apply(lambda x: len(str(x)))
print("Peoplesoft", NLP_data[NLP_data['LabelId']==0]['char_count'].mean())
print("SQL_Developer", NLP_data[NLP_data['LabelId']==1]['char_count'].mean())
print("Workday", NLP_data[NLP_data['LabelId']==2]['char_count'].mean())
print("React_JS", NLP_data[NLP_data['LabelId']==3]['char_count'].mean())
print("Internship", NLP_data[NLP_data['LabelId']==4]['char_count'].mean())
Peoplesoft 7402.05 SQL_Developer 4644.5 Workday 6558.35 React_JS 3412.409090909091 Internship 4038.0
# PLOTTING CHARACTER-COUNT
fig,axs=plt.subplots(3,2,figsize=(10,10))
train_words=NLP_data[NLP_data['LabelId']==0]['char_count']
axs[0,0].hist(train_words,color='red')
axs[0,0].set_title('Peoplesoft')
train_words=NLP_data[NLP_data['LabelId']==1]['char_count']
axs[0,1].hist(train_words,color='green')
axs[0,1].set_title('SQL_Developer')
train_words=NLP_data[NLP_data['LabelId']==2]['char_count']
axs[1,0].hist(train_words,color='lightblue')
axs[1,0].set_title('Workday')
train_words=NLP_data[NLP_data['LabelId']==3]['char_count']
axs[1,1].hist(train_words,color='lightblue')
axs[1,1].set_title('React_JS')
train_words=NLP_data[NLP_data['LabelId']==4]['char_count']
axs[2,0].hist(train_words,color='lightblue')
axs[2,0].set_title('Internship')
fig.suptitle('Resume Character Count Under Each Label')
plt.show()
Inference: Visualizing the Character-count: We can see the average character in all the labels The average characters in a Peoplesoft resumes is 7402.05 as compared to an average characters of all other resume.
#convert to lowercase, strip, remove punctuations, remove URL, remove HTML and remove emoji
def preprocess(text):
text = text.lower()
text = text.strip()
text = re.compile('<.*?>').sub('', text)
text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
text = re.sub('\s+', ' ', text)
text = re.sub(r'\[[0-9]*\]',' ',text)
text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)
text = re.compile(r'https?://\S+|www\.\S+').sub(r'',text)
text = re.compile(r'<.*?>').sub(r'',text)
text = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE).sub(r'', text)
return text
# STOPWORD REMOVAL
def stopword(string):
a= [i for i in string.split() if i not in stopwords.words('english')]
return ' '.join(a)
#LEMMATIZATION
# Initialize the lemmatizer
wl = WordNetLemmatizer()
# This is a helper function to map NLTK position tags
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
'''
porter=PorterStemmer()
def stemString(string):
token_words=word_tokenize(string)
token_words
stem_string=[]
for word in token_words:
stem_string.append(porter.stem(word))
stem_string.append(" ")
return "".join(stem_string)
'''
# Tokenize the sentence
def lemmatizer(string):
word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags
a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token
return " ".join(a)
def finalpreprocess(string):
return lemmatizer(stopword(preprocess(string)))
<>:7: DeprecationWarning: invalid escape sequence \s <>:7: DeprecationWarning: invalid escape sequence \s C:\Users\rahul\AppData\Local\Temp\ipykernel_18176\2504341434.py:7: DeprecationWarning: invalid escape sequence \s
#Final pre-processing
NLP_data['clean_text'] = NLP_data['Extracted'].apply(lambda x: finalpreprocess(x))
NLP_data.head()
| Extracted | Label | LabelId | word_count | char_count | clean_text | |
|---|---|---|---|---|---|---|
| 0 | Anubhav Kumar Singh Core Competencies ... | Peoplesoft resumes | 0 | 973 | 8010 | anubhav kumar singh core competency script she... |
| 1 | G Ananda Rayudu https www linked... | Peoplesoft resumes | 0 | 924 | 8318 | g ananda rayudu http www linkedin com anandgud... |
| 2 | PeopleSoft Database Administrator ... | Peoplesoft resumes | 0 | 780 | 6900 | peoplesoft database administrator gangareddy p... |
| 3 | Classification Internal Classification Inte... | Peoplesoft resumes | 0 | 593 | 4918 | classification internal classification interna... |
| 4 | Priyanka Ramadoss MountPleasant C... | Peoplesoft resumes | 0 | 631 | 5196 | priyanka ramadoss mountpleasant coonoor nilgir... |
NLP_data['resume_len'] = NLP_data['Extracted'].astype(str).apply(len)
#Polarity shows the sentiment of a piece of text. It counts the negative and positive words and determines the polarity.
#The value ranges from -1 to 1 where -1 represents the negative sentiment,
#0 represents neutral and 1 represent positive sentiment.
NLP_data['polarity'] = NLP_data['clean_text'].map(lambda text: TextBlob(text).sentiment.polarity)
NLP_data.head()
| Extracted | Label | LabelId | word_count | char_count | clean_text | resume_len | polarity | |
|---|---|---|---|---|---|---|---|---|
| 0 | Anubhav Kumar Singh Core Competencies ... | Peoplesoft resumes | 0 | 973 | 8010 | anubhav kumar singh core competency script she... | 8010 | 0.057173 |
| 1 | G Ananda Rayudu https www linked... | Peoplesoft resumes | 0 | 924 | 8318 | g ananda rayudu http www linkedin com anandgud... | 8318 | 0.227980 |
| 2 | PeopleSoft Database Administrator ... | Peoplesoft resumes | 0 | 780 | 6900 | peoplesoft database administrator gangareddy p... | 6900 | 0.228829 |
| 3 | Classification Internal Classification Inte... | Peoplesoft resumes | 0 | 593 | 4918 | classification internal classification interna... | 4918 | 0.021143 |
| 4 | Priyanka Ramadoss MountPleasant C... | Peoplesoft resumes | 0 | 631 | 5196 | priyanka ramadoss mountpleasant coonoor nilgir... | 5196 | 0.064713 |
NLP_data['polarity'].iplot(kind='hist',bins=50,xTitle='polarity',linecolor='black',yTitle='count',title='Sentiment Polarity Distribution')
Inference: Vast majority of the sentiment polarity scores are greater than zero, means most of them are pretty positive.
NLP_data['resume_len'].iplot(kind='hist',bins=50,xTitle='resume length',linecolor='black',yTitle='count',title='Resume Text Length Distribution')
Inference: The resume length ranges between 2k to 18k with most common length between 3k to 5k.
#The distribution of top unigrams after removing stop words
def get_top_n_words(corpus, n=None):
vec = CountVectorizer(stop_words = 'english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_words(NLP_data['clean_text'], 10)
for word, freq in common_words:
print(word, freq)
df2 = pd.DataFrame(common_words, columns = ['ResumeText' , 'count'])
df2.groupby('ResumeText').sum()['count'].sort_values(ascending=True).iplot(kind='bar', yTitle='Count', linecolor='black',orientation='h', title='Top 10 words in resume after cleaning data')
experience 631 application 534 report 518 server 512 work 498 use 494 peoplesoft 453 workday 410 project 384 create 370
#The distribution of top Bigrams after removing stop words
def get_top_n_bigram(corpus, n=None):
vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_bigram(NLP_data['clean_text'], 10)
for word, freq in common_words:
print(word, freq)
df3 = pd.DataFrame(common_words, columns = ['ResumeText' , 'count'])
df3.groupby('ResumeText').sum()['count'].sort_values(ascending=True).iplot(kind='bar', yTitle='Count', linecolor='black',orientation='h', title='Top 10 bigrams in resume after cleaning data')
application server 124 sql server 117 process scheduler 109 web server 94 business process 84 people tool 80 custom report 76 core connector 73 workday hcm 67 workday studio 66
#The distribution of top trigrams after removing stop words
def get_top_n_trigram(corpus, n=None):
vec = CountVectorizer(ngram_range=(3, 3)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_trigram(NLP_data['clean_text'], 10)
for word, freq in common_words:
print(word, freq)
df5 = pd.DataFrame(common_words, columns = ['ResumeText' , 'count'])
df5.groupby('ResumeText').sum()['count'].sort_values(ascending=True).iplot(kind='bar', yTitle='Count', linecolor='black',orientation='h', title='Top 10 trigrams in resume after cleaning data')
process scheduler server 56 server web server 47 server process scheduler 41 application server web 37 application server domains 36 server domains process 35 domains process scheduler 35 operate system windows 28 peoplesoft internet architecture 26 summary year experience 26
blob = TextBlob(str(NLP_data['clean_text']))
pos_df = pd.DataFrame(blob.tags, columns = ['word' , 'pos'])
pos_df = pos_df.pos.value_counts()[:20]
pos_df.iplot(
kind='bar',
xTitle='POS',
yTitle='count',
title='Top Part-of-speech tagging for resume corpus')
y0 = NLP_data.loc[NLP_data['Label'] == 'Peoplesoft resumes']['polarity']
y1 = NLP_data.loc[NLP_data['Label'] == 'SQL Developer Lightning insight']['polarity']
y2 = NLP_data.loc[NLP_data['Label'] == 'workday resumes']['polarity']
y3 = NLP_data.loc[NLP_data['Label'] == 'React JS']['polarity']
y4 = NLP_data.loc[NLP_data['Label'] == 'Internship']['polarity']
trace0 = go.Box(
y=y0,
name = 'Peoplesoft resumes',
marker = dict(
color = 'rgb(214, 12, 140)',
)
)
trace1 = go.Box(
y=y1,
name = 'SQL Developer Lightning insight',
marker = dict(
color = 'rgb(0, 128, 128)',
)
)
trace2 = go.Box(
y=y2,
name = 'workday resumes',
marker = dict(
color = 'rgb(10, 140, 208)',
)
)
trace3 = go.Box(
y=y3,
name = 'React JS',
marker = dict(
color = 'rgb(12, 102, 14)',
)
)
trace4 = go.Box(
y=y4,
name = 'Internship',
marker = dict(
color = 'rgb(10, 0, 100)',
)
)
data = [trace0, trace1, trace2, trace3, trace4]
layout = go.Layout(
title = "Sentiment Polarity Boxplot of Different Resume Category"
)
fig = go.Figure(data=data,layout=layout)
iplot(fig, filename = "Sentiment Polarity Boxplot of Resume")
Inference:The highest sentiment polarity score was achieved by Internship resume category, rest all of the four category are between 0 to 0.2 mostly. The workday resume has the lowest median polarity score.
y0 = NLP_data.loc[NLP_data['Label'] == 'Peoplesoft resumes']['resume_len']
y1 = NLP_data.loc[NLP_data['Label'] == 'SQL Developer Lightning insight']['resume_len']
y2 = NLP_data.loc[NLP_data['Label'] == 'workday resumes']['resume_len']
y3 = NLP_data.loc[NLP_data['Label'] == 'React JS']['resume_len']
y4 = NLP_data.loc[NLP_data['Label'] == 'Internship']['resume_len']
trace0 = go.Box(
y=y0,
name = 'Peoplesoft resumes',
marker = dict(
color = 'rgb(214, 12, 140)',
)
)
trace1 = go.Box(
y=y1,
name = 'SQL Developer Lightning insight',
marker = dict(
color = 'rgb(0, 128, 128)',
)
)
trace2 = go.Box(
y=y2,
name = 'workday resumes',
marker = dict(
color = 'rgb(10, 140, 208)',
)
)
trace3 = go.Box(
y=y3,
name = 'React JS',
marker = dict(
color = 'rgb(12, 102, 14)',
)
)
trace4 = go.Box(
y=y4,
name = 'Internship',
marker = dict(
color = 'rgb(10, 0, 100)',
)
)
data = [trace0, trace1, trace2, trace3, trace4]
layout = go.Layout(
title = "Resume Length Boxplot of Different Resume Category"
)
fig = go.Figure(data=data,layout=layout)
iplot(fig, filename = "Resume Length Boxplot of Resume")
Inference: The median resume length of React JS & Internship category are relative lower than those of the other resume category.
Topic modeling is the process of using unsupervised learning techniques to extract the main topics that occur in a collection of documents.
def preprocess_news(df):
corpus=[]
stem=PorterStemmer()
lem=WordNetLemmatizer()
for news in df['Extracted']:
words=[w for w in word_tokenize(news) if (w not in stop)]
words=[lem.lemmatize(w) for w in words if len(w)>2]
corpus.append(words)
return corpus
new= NLP_data['clean_text'].str.split()
new=new.values.tolist()
corpus=preprocess_news(NLP_data)
#Let’s create the bag of words model using gensim
dic=gensim.corpora.Dictionary(corpus)
bow_corpus = [dic.doc2bow(doc) for doc in corpus]
lda_model = gensim.models.LdaMulticore(bow_corpus,
num_topics = 5,
id2word = dic,
passes = 10,
workers = 2)
lda_model.show_topics()
[(0, '0.022*"SQL" + 0.009*"Server" + 0.008*"data" + 0.008*"using" + 0.007*"Experience" + 0.007*"Project" + 0.006*"report" + 0.005*"query" + 0.005*"experience" + 0.005*"Developer"'), (1, '0.009*"knowledge" + 0.009*"React" + 0.008*"HTML" + 0.008*"using" + 0.007*"CSS" + 0.007*"Good" + 0.006*"JavaScript" + 0.005*"application" + 0.005*"web" + 0.004*"Experience"'), (2, '0.029*"PeopleSoft" + 0.015*"Application" + 0.015*"Experience" + 0.014*"server" + 0.012*"Server" + 0.009*"Web" + 0.009*"using" + 0.008*"Database" + 0.008*"Process" + 0.008*"Oracle"'), (3, '0.014*"Application" + 0.011*"PeopleSoft" + 0.008*"People" + 0.007*"Project" + 0.007*"FSCM" + 0.006*"application" + 0.006*"Tools" + 0.005*"requirement" + 0.005*"Environment" + 0.005*"data"'), (4, '0.024*"Workday" + 0.011*"integration" + 0.010*"using" + 0.009*"EIB" + 0.009*"report" + 0.008*"business" + 0.008*"HCM" + 0.007*"experience" + 0.007*"Core" + 0.007*"requirement"')]
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, bow_corpus, dic)
vis
Inference:
So in our case, we can see a lot of words and topics associated with PeopleSoft and Workday.
correlation = NLP_data[['Label','polarity', 'resume_len', 'word_count', 'char_count']].corr()
mask = np.zeros_like(correlation, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(10,5))
plt.xticks()
plt.yticks()
sns.heatmap(correlation, cmap='coolwarm', annot=True, linewidths=10, vmin=-1.5, mask=mask)
C:\Users\rahul\AppData\Local\Temp\ipykernel_18176\2240914296.py:2: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
<AxesSubplot:>
Inference: It shows the relation between different variable, the sentiment (polarity) is in negative. That means word use for resume making are mostly negative.
Named entity recognition is an information extraction method in which entities that are present in the text are classified into predefined entity types like “Person”,” Place”,” Organization”, etc. By using NER we can get great insights about the types of entities present in the given text dataset.
def ner(text):
doc=nlp(text)
return [X.label_ for X in doc.ents]
ent=NLP_data['clean_text'].\
apply(lambda x : ner(x))
ent=[x for sub in ent for x in sub]
counter=Counter(ent)
count=counter.most_common()
x,y=map(list,zip(*count))
sns.barplot(x=y,y=x)
<AxesSubplot:>
def ner(text,ent="ORG"):
doc=nlp(text)
return [X.text for X in doc.ents if X.label_ == ent]
org=NLP_data['clean_text'].apply(lambda x: ner(x))
org=[i for x in org for i in x]
counter=Counter(org)
x,y=map(list,zip(*counter.most_common(10)))
sns.barplot(y,x)
<AxesSubplot:>
def ner(text,ent="GPE"):
doc=nlp(text)
return [X.text for X in doc.ents if X.label_ == ent]
org=NLP_data['clean_text'].apply(lambda x: ner(x))
org=[i for x in org for i in x]
counter=Counter(org)
x,y=map(list,zip(*counter.most_common(10)))
sns.barplot(y,x)
<AxesSubplot:>
Inference: We have used Spacy to derive NER. We can see that the model is far from perfect classifying. It can not exactly drive the skills from the resume, But yes we can get little bit of information like in which organisation does the person worked, country where he worked and the language know by him. For skills we need to explore more.
we discussed and implemented various exploratory data analysis methods for text data.
The process to convert text data into numerical data/vector, is called vectorization or in the NLP world, word embedding. Bag-of-Words(BoW) and Word Embedding (with Word2Vec and Glove) are two well-known methods for converting text data to numerical data.
Bag of Words
Word2Vec
One of the major drawbacks of using Bag-of-words techniques is that it can’t capture the meaning or relation of the words from vectors. Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network which is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.
We can use any of these approaches to convert our text data to numerical form which will be used to build the classification model.
Glove
GloVe stands for Global Vectors for word representation. It is an unsupervised learning algorithm developed by researchers at Stanford University aiming to generate word embeddings by aggregating global word co-occurrence matrices from a given corpus.GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
Difference in Glove and Word2Vec
Glove model is based on leveraging global word to word co-occurance counts leveraging the entire corpus. Word2vec on the other hand leverages co-occurance within local context (neighbouring words). In practice, however, both these models give similar results for many tasks.
# CountVectorizer
def cv(data):
count_vectorizer = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1, 3), stop_words = 'english')
emb = count_vectorizer.fit_transform(data)
return emb, count_vectorizer
#Term Frequency-Inverse Document Frequencies (tf-Idf)
def tfidf(data):
tfidf_vectorizer = TfidfVectorizer(min_df=3, max_features=None,
strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
stop_words = 'english')
train = tfidf_vectorizer.fit_transform(data)
return train, tfidf_vectorizer
# create Word2vec model
#here words_f should be a list containing words from each document. say 1st row of the list is words from the 1st document/sentence
#length of words_f is number of documents/sentences in your dataset
NLP_data['clean_text_tok']=[nltk.word_tokenize(i) for i in NLP_data['clean_text']] #convert preprocessed sentence to tokenized sentence
model = Word2Vec(NLP_data['clean_text_tok'],min_count=1) #min_count=1 means word should be present at least across all documents,
#if min_count=2 means if the word is present less than 2 times across all the documents then we shouldn't consider it
w2v = dict(zip(model.wv.index_to_key, model.wv.vectors)) #combination of word and its vector
#for converting sentence to vectors/numbers from word vectors result by Word2Vec
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
# if a text is empty we should return a vector of zeros
# with the same dimensionality as all the other vectors
self.dim = len(next(iter(word2vec.values())))
def fit(self, X, y):
return self
def transform(self, X):
return np.array([
np.mean([self.word2vec[w] for w in words if w in self.word2vec]
or [np.zeros(self.dim)], axis=0)
for words in X
])
GloVe Features
#GloVe Features
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(NLP_data.Extracted)
word_index = tokenizer.word_index
vocab_size = len(tokenizer.word_index) + 1
print("Vocabulary Size :", vocab_size)
Vocabulary Size : 4660
tokenizer create tokens for every word in the data corpus and map them to a index using dictionary.
word_index contains the index for each word
vocab_size represents the total number of word in the data corpus
GLOVE_EMB = 'C:/Users/rahul/Project 114/glove.6B.300d.txt'
embeddings_index = {}
f = open(GLOVE_EMB,encoding="utf-8",errors='ignore')
for line in f:
values = line.split()
word = value = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' %len(embeddings_index))
Found 400000 word vectors.
# this function creates a normalized vector for the whole sentence
stop_words = stopwords.words('english')
def sent2vec(s):
words = str(s).lower()
words = word_tokenize(words)
words = [w for w in words if not w in stop_words]
words = [w for w in words if w.isalpha()]
M = []
for w in words:
try:
M.append(embeddings_index[w])
except:
continue
M = np.array(M)
v = M.sum(axis=0)
if type(v) != np.ndarray:
return np.zeros(300)
return v / np.sqrt((v ** 2).sum())
NLP_data
| Extracted | Label | LabelId | word_count | char_count | clean_text | resume_len | polarity | clean_text_tok | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Anubhav Kumar Singh Core Competencies ... | Peoplesoft resumes | 0 | 973 | 8010 | anubhav kumar singh core competency script she... | 8010 | 0.057173 | [anubhav, kumar, singh, core, competency, scri... |
| 1 | G Ananda Rayudu https www linked... | Peoplesoft resumes | 0 | 924 | 8318 | g ananda rayudu http www linkedin com anandgud... | 8318 | 0.227980 | [g, ananda, rayudu, http, www, linkedin, com, ... |
| 2 | PeopleSoft Database Administrator ... | Peoplesoft resumes | 0 | 780 | 6900 | peoplesoft database administrator gangareddy p... | 6900 | 0.228829 | [peoplesoft, database, administrator, gangared... |
| 3 | Classification Internal Classification Inte... | Peoplesoft resumes | 0 | 593 | 4918 | classification internal classification interna... | 4918 | 0.021143 | [classification, internal, classification, int... |
| 4 | Priyanka Ramadoss MountPleasant C... | Peoplesoft resumes | 0 | 631 | 5196 | priyanka ramadoss mountpleasant coonoor nilgir... | 5196 | 0.064713 | [priyanka, ramadoss, mountpleasant, coonoor, n... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 73 | Pranish Sonone Career summary ... | React JS | 3 | 226 | 1773 | pranish sonone career summary experience year ... | 1773 | 0.225388 | [pranish, sonone, career, summary, experience,... |
| 74 | Ranga Gaganam Professional Summary... | React JS | 3 | 368 | 3299 | ranga gaganam professional summary professiona... | 3299 | 0.258242 | [ranga, gaganam, professional, summary, profes... |
| 75 | SHAIK ABDUL SHARUK years Experience in ... | React JS | 3 | 372 | 3111 | shaik abdul sharuk year experience wipro caree... | 3111 | 0.164077 | [shaik, abdul, sharuk, year, experience, wipro... |
| 76 | Name Ravali P ... | Internship | 4 | 793 | 6175 | name ravali p curriculum vitae specialization ... | 6175 | 0.434170 | [name, ravali, p, curriculum, vitae, specializ... |
| 77 | SUSOVAN BAG Seeking a challenging posi... | Internship | 4 | 198 | 1901 | susovan bag seek challenging position field sc... | 1901 | 0.311429 | [susovan, bag, seek, challenging, position, fi... |
78 rows × 9 columns
#SPLITTING THE TRAINING DATASET INTO TRAIN AND TEST
X_train, X_test, y_train, y_test = train_test_split(NLP_data["clean_text"],NLP_data["LabelId"],test_size=0.3, random_state=30,shuffle=True)
#CountVectorizer
X_train_counts, count_vectorizer = cv(X_train)
X_test_counts = count_vectorizer.transform(X_test)
#Tf-Idf
X_train_vectors_tfidf, tfidf_vectorizer = tfidf(X_train)
X_test_vectors_tfidf = tfidf_vectorizer.transform(X_test)
#Word2Vec
# Word2Vec runs on tokenized sentences
X_train_tok= [nltk.word_tokenize(i) for i in X_train]
X_test_tok= [nltk.word_tokenize(i) for i in X_test]
# converting text to numerical data using Word2Vec
# Fit and transform
modelw = MeanEmbeddingVectorizer(w2v)
X_train_vectors_w2v = modelw.transform(X_train_tok)
X_test_vectors_w2v = modelw.transform(X_test_tok)
#Glove
# create sentence vectors using the above function for training and validation set
xtrain_glove = [sent2vec(x) for x in tqdm(X_train)]
xtest_glove = [sent2vec(x) for x in tqdm(X_test)]
xtrain_glove = np.array(xtrain_glove)
xtest_glove = np.array(xtest_glove)
100%|█████████████████████████████████████████████████████████████████████████████████| 54/54 [00:00<00:00, 217.02it/s] 100%|█████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 229.07it/s]
#multi-class log-loss
def multiclass_logloss(actual, predicted, eps=1e-15):
"""Multi class version of Logarithmic Loss metric.
:param actual: Array containing the actual target classes
:param predicted: Matrix with class predictions, one probability per class
"""
# Convert 'actual' to a binary array if it's not already:
if len(actual.shape) == 1:
actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
for i, val in enumerate(actual):
actual2[i, val] = 1
actual = actual2
clip = np.clip(predicted, eps, 1 - eps)
rows = actual.shape[0]
vsota = np.sum(actual * np.log(clip))
return -1.0 / rows * vsota
#Model Logistic Regression
def logistic_regression(X_train, y_train, X_test, y_test):
cv = LeaveOneOut()
lr= LogisticRegression(solver = 'liblinear', C=10, penalty = 'l2')
lr.fit(X_train, y_train)
#predict y value for dataset
y_predict= lr.predict(X_test)
y_prob= lr.predict_proba(X_test)
print(classification_report(y_test,y_predict))
print ("logloss: %0.3f " % multiclass_logloss(y_test, y_prob))
#use LOOCV to evaluate model
scores= cross_val_score(lr, X_train, y_train, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
MAE=mean(absolute(scores))
RMSE=sqrt(mean(absolute(scores)))
print ("Mean Absolute Error: ", MAE)
print ("Root Mean Squared Error: ", RMSE)
conf_matrix= confusion_matrix(y_test, y_predict)
ax= sns.heatmap(conf_matrix, annot=True, cmap='Blues')
print('Confusion Matrix:', ax)
#Singular-Value Decomposition The Singular-Value Decomposition, or SVD for short,
#is a matrix decomposition method for reducing a matrix to its constituent parts in order to make certain subsequent
#matrix calculations simpler. Since SVMs take a lot of time, we will reduce the number of features from the TF-IDF using
#Singular Value Decomposition before applying SVM.
def svm_classifier(X_train, y_train, X_test, y_test):
svd = decomposition.TruncatedSVD(n_components=120)
svd.fit(X_train)
xtrain_svd = svd.transform(X_train)
xtest_svd = svd.transform(X_test)
# Scale the data obtained from SVD. Renaming variable to reuse without scaling.
scl = preprocessing.StandardScaler(with_mean=False)
scl.fit(xtrain_svd)
xtrain_svd_scl = scl.transform(xtrain_svd)
xtest_svd_scl = scl.transform(xtest_svd)
# Fitting a simple SVM
clf = SVC(C=1.0, probability=True) # since we need probabilities
clf.fit(xtrain_svd_scl, y_train)
y_predict = clf.predict(xtest_svd_scl)
y_prob= clf.predict_proba(xtest_svd_scl)
print(classification_report(y_test,y_predict))
print ("logloss: %0.3f " % multiclass_logloss(y_test, y_prob))
#use LOOCV to evaluate model
cv = LeaveOneOut()
scores= cross_val_score(clf, xtrain_svd_scl, y_train, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
MAE=mean(absolute(scores))
RMSE=sqrt(mean(absolute(scores)))
print ("Mean Absolute Error: ", MAE)
print ("Root Mean Squared Error: ", RMSE)
conf_matrix= confusion_matrix(y_test, y_predict)
ax= sns.heatmap(conf_matrix, annot=True, cmap='Blues')
print('Confusion Matrix:', ax)
#.tocsc- Convert this matrix to Compressed Sparse Column format
#Duplicate entries will be summed together.
def xgb_classifier(X_train, y_train, X_test, y_test):
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8,
subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(X_train.tocsc(), y_train)
y_predict = clf.predict(X_test.tocsc())
y_prob= clf.predict_proba(X_test.tocsc())
print(classification_report(y_test,y_predict))
print ("logloss: %0.3f " % multiclass_logloss(y_test, y_prob))
#use LOOCV to evaluate model
cv = LeaveOneOut()
scores= cross_val_score(clf, X_train.tocsc(), y_train, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
MAE=mean(absolute(scores))
RMSE=sqrt(mean(absolute(scores)))
print ("Mean Absolute Error: ", MAE)
print ("Root Mean Squared Error: ", RMSE)
conf_matrix= confusion_matrix(y_test, y_predict)
ax= sns.heatmap(conf_matrix, annot=True, cmap='Blues')
print('Confusion Matrix:', ax)
def naive_bayes(X_train, y_train, X_test, y_test):
# Fitting a simple Naive Bayes
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
y_prob= clf.predict_proba(X_test)
print(classification_report(y_test,y_predict))
print ("logloss: %0.3f " % multiclass_logloss(y_test, y_prob))
#use LOOCV to evaluate model
cv = LeaveOneOut()
scores= cross_val_score(clf, X_train, y_train, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
MAE=mean(absolute(scores))
RMSE=sqrt(mean(absolute(scores)))
print ("Mean Absolute Error: ", MAE)
print ("Root Mean Squared Error: ", RMSE)
conf_matrix= confusion_matrix(y_test, y_predict)
ax= sns.heatmap(conf_matrix, annot=True, cmap='Blues')
print('Confusion Matrix:', ax)
mll_scorer = metrics.make_scorer(multiclass_logloss, greater_is_better=False, needs_proba=True)
def naive_grid(X_train, y_train, X_test, y_test):
nb_model = MultinomialNB()
# Create the pipeline
clf = pipeline.Pipeline([('nb', nb_model)])
# parameter grid
param_grid = {'nb__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
# Initialize Grid Search Model
model = GridSearchCV(estimator=clf, param_grid=param_grid, scoring=mll_scorer,
verbose=10, n_jobs=-1, refit=True, cv=2)
# Fit Grid Search Model
model.fit(X_train, y_train) # we can use the full data here but im only using xtrain.
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
y_predict = model.predict(X_test)
y_prob= model.predict_proba(X_test)
print(classification_report(y_test,y_predict))
#use LOOCV to evaluate model
cv = LeaveOneOut()
scores= cross_val_score(model, X_train, y_train, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
MAE=mean(absolute(scores))
RMSE=sqrt(mean(absolute(scores)))
print ("Mean Absolute Error: ", MAE)
print ("Root Mean Squared Error: ", RMSE)
conf_matrix= confusion_matrix(y_test, y_predict)
ax= sns.heatmap(conf_matrix, annot=True, cmap='Blues')
print('Confusion Matrix:', ax)
#DecisionTree Classifier
def decisiontree_classifier(X_train, y_train, X_test, y_test):
clf= DecisionTreeClassifier()
clf.fit(X_train, y_train)
#predict y value for dataset
y_predict= clf.predict(X_test)
y_prob= clf.predict_proba(X_test)
print(classification_report(y_test,y_predict))
print ("logloss: %0.3f " % multiclass_logloss(y_test, y_prob))
#use LOOCV to evaluate model
cv = LeaveOneOut()
scores= cross_val_score(clf, X_train, y_train, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
MAE=mean(absolute(scores))
RMSE=sqrt(mean(absolute(scores)))
print ("Mean Absolute Error: ", MAE)
print ("Root Mean Squared Error: ", RMSE)
conf_matrix= confusion_matrix(y_test, y_predict)
ax= sns.heatmap(conf_matrix, annot=True, cmap='Blues')
print('Confusion Matrix:', ax)
#RandomForest Classifier
def randomforest_classifier(X_train, y_train, X_test, y_test):
clf= RandomForestClassifier()
clf.fit(X_train, y_train)
#predict y value for dataset
y_predict= clf.predict(X_test)
y_prob= clf.predict_proba(X_test)
print(classification_report(y_test,y_predict))
print ("logloss: %0.3f " % multiclass_logloss(y_test, y_prob))
#use LOOCV to evaluate model
cv = LeaveOneOut()
scores= cross_val_score(clf, X_train, y_train, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
MAE=mean(absolute(scores))
RMSE=sqrt(mean(absolute(scores)))
print ("Mean Absolute Error: ", MAE)
print ("Root Mean Squared Error: ", RMSE)
conf_matrix= confusion_matrix(y_test, y_predict)
ax= sns.heatmap(conf_matrix, annot=True, cmap='Blues')
print('Confusion Matrix:', ax)
#KNeighbors Classifier
def kneighbors_classifier(X_train, y_train, X_test, y_test):
clf= KNeighborsClassifier()
clf.fit(X_train, y_train)
#predict y value for dataset
y_predict= clf.predict(X_test)
y_prob= clf.predict_proba(X_test)
print(classification_report(y_test,y_predict))
print ("logloss: %0.3f " % multiclass_logloss(y_test, y_prob))
#use LOOCV to evaluate model
cv = LeaveOneOut()
scores= cross_val_score(clf, X_train, y_train, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
MAE=mean(absolute(scores))
RMSE=sqrt(mean(absolute(scores)))
print ("Mean Absolute Error: ", MAE)
print ("Root Mean Squared Error: ", RMSE)
conf_matrix= confusion_matrix(y_test, y_predict)
ax= sns.heatmap(conf_matrix, annot=True, cmap='Blues')
print('Confusion Matrix:', ax)
naive_grid(X_train_counts, y_train, X_test_counts , y_test)
Fitting 2 folds for each of 6 candidates, totalling 12 fits
Best score: -1.919
Best parameters set:
nb__alpha: 1
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 6
2 1.00 1.00 1.00 4
3 1.00 1.00 1.00 10
accuracy 1.00 24
macro avg 1.00 1.00 1.00 24
weighted avg 1.00 1.00 1.00 24
Mean Absolute Error: 0.1111111111111111
Root Mean Squared Error: 0.3333333333333333
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
naive_bayes(X_train_counts, y_train, X_test_counts , y_test)
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 6
2 1.00 1.00 1.00 4
3 1.00 1.00 1.00 10
accuracy 1.00 24
macro avg 1.00 1.00 1.00 24
weighted avg 1.00 1.00 1.00 24
logloss: 0.000
Mean Absolute Error: 0.14814814814814814
Root Mean Squared Error: 0.3849001794597505
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
xgb_classifier(X_train_counts, y_train, X_test_counts , y_test)
[03:40:03] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 6
2 1.00 1.00 1.00 4
3 1.00 1.00 1.00 10
accuracy 1.00 24
macro avg 1.00 1.00 1.00 24
weighted avg 1.00 1.00 1.00 24
logloss: 0.112
Mean Absolute Error: 0.018518518518518517
Root Mean Squared Error: 0.13608276348795434
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
svm_classifier(X_train_counts, y_train, X_test_counts , y_test)
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 0.33 0.50 6
2 0.50 1.00 0.67 4
3 1.00 1.00 1.00 10
accuracy 0.83 24
macro avg 0.88 0.83 0.79 24
weighted avg 0.92 0.83 0.82 24
logloss: 3.111
Mean Absolute Error: 1.6296296296296295
Root Mean Squared Error: 1.2765694770084508
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
logistic_regression(X_train_counts, y_train, X_test_counts , y_test)
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 6
2 1.00 1.00 1.00 4
3 1.00 1.00 1.00 10
accuracy 1.00 24
macro avg 1.00 1.00 1.00 24
weighted avg 1.00 1.00 1.00 24
logloss: 0.007
Mean Absolute Error: 0.09259259259259259
Root Mean Squared Error: 0.3042903097250923
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
decisiontree_classifier(X_train_counts, y_train, X_test_counts , y_test)
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 0.83 0.91 6
2 1.00 1.00 1.00 4
3 0.91 1.00 0.95 10
accuracy 0.96 24
macro avg 0.98 0.96 0.97 24
weighted avg 0.96 0.96 0.96 24
logloss: 1.439
Mean Absolute Error: 0.12962962962962962
Root Mean Squared Error: 0.3600411499115478
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
randomforest_classifier(X_train_counts, y_train, X_test_counts , y_test)
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 0.83 0.91 6
2 1.00 1.00 1.00 4
3 0.91 1.00 0.95 10
accuracy 0.96 24
macro avg 0.98 0.96 0.97 24
weighted avg 0.96 0.96 0.96 24
logloss: 0.430
Mean Absolute Error: 0.09259259259259259
Root Mean Squared Error: 0.3042903097250923
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
kneighbors_classifier(X_train_counts, y_train, X_test_counts , y_test)
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 0.83 0.91 6
2 1.00 1.00 1.00 4
3 0.91 1.00 0.95 10
accuracy 0.96 24
macro avg 0.98 0.96 0.97 24
weighted avg 0.96 0.96 0.96 24
logloss: 0.291
Mean Absolute Error: 0.42592592592592593
Root Mean Squared Error: 0.6526300069150406
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
naive_grid(X_train_vectors_tfidf, y_train, X_test_vectors_tfidf , y_test)
Fitting 2 folds for each of 6 candidates, totalling 12 fits
Best score: -0.261
Best parameters set:
nb__alpha: 0.01
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 6
2 1.00 1.00 1.00 4
3 1.00 1.00 1.00 10
accuracy 1.00 24
macro avg 1.00 1.00 1.00 24
weighted avg 1.00 1.00 1.00 24
Mean Absolute Error: 0.14814814814814814
Root Mean Squared Error: 0.3849001794597505
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
naive_bayes(X_train_vectors_tfidf, y_train, X_test_vectors_tfidf , y_test)
precision recall f1-score support
0 0.80 1.00 0.89 4
1 1.00 0.83 0.91 6
2 1.00 1.00 1.00 4
3 1.00 1.00 1.00 10
accuracy 0.96 24
macro avg 0.95 0.96 0.95 24
weighted avg 0.97 0.96 0.96 24
logloss: 0.474
Mean Absolute Error: 0.14814814814814814
Root Mean Squared Error: 0.3849001794597505
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
xgb_classifier(X_train_vectors_tfidf, y_train, X_test_vectors_tfidf , y_test)
[03:43:01] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 0.83 0.91 6
2 1.00 1.00 1.00 4
3 0.91 1.00 0.95 10
accuracy 0.96 24
macro avg 0.98 0.96 0.97 24
weighted avg 0.96 0.96 0.96 24
logloss: 0.129
Mean Absolute Error: 0.05555555555555555
Root Mean Squared Error: 0.23570226039551584
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
svm_classifier(X_train_vectors_tfidf, y_train, X_test_vectors_tfidf , y_test)
precision recall f1-score support
0 0.33 1.00 0.50 4
1 0.00 0.00 0.00 6
2 1.00 1.00 1.00 4
3 1.00 0.80 0.89 10
accuracy 0.67 24
macro avg 0.58 0.70 0.60 24
weighted avg 0.64 0.67 0.62 24
logloss: 2.023
Mean Absolute Error: 2.074074074074074
Root Mean Squared Error: 1.4401645996461911
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
logistic_regression(X_train_vectors_tfidf, y_train, X_test_vectors_tfidf , y_test)
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 6
2 1.00 1.00 1.00 4
3 1.00 1.00 1.00 10
accuracy 1.00 24
macro avg 1.00 1.00 1.00 24
weighted avg 1.00 1.00 1.00 24
logloss: 0.263
Mean Absolute Error: 0.12962962962962962
Root Mean Squared Error: 0.3600411499115478
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
decisiontree_classifier(X_train_vectors_tfidf, y_train, X_test_vectors_tfidf , y_test)
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 0.67 0.80 6
2 1.00 1.00 1.00 4
3 0.83 1.00 0.91 10
accuracy 0.92 24
macro avg 0.96 0.92 0.93 24
weighted avg 0.93 0.92 0.91 24
logloss: 2.878
Mean Absolute Error: 0.14814814814814814
Root Mean Squared Error: 0.3849001794597505
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
randomforest_classifier(X_train_vectors_tfidf, y_train, X_test_vectors_tfidf , y_test)
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 6
2 1.00 1.00 1.00 4
3 1.00 1.00 1.00 10
accuracy 1.00 24
macro avg 1.00 1.00 1.00 24
weighted avg 1.00 1.00 1.00 24
logloss: 0.364
Mean Absolute Error: 0.1111111111111111
Root Mean Squared Error: 0.3333333333333333
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
kneighbors_classifier(X_train_vectors_tfidf, y_train, X_test_vectors_tfidf , y_test)
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 6
2 1.00 1.00 1.00 4
3 1.00 1.00 1.00 10
accuracy 1.00 24
macro avg 1.00 1.00 1.00 24
weighted avg 1.00 1.00 1.00 24
logloss: 0.046
Mean Absolute Error: 0.12962962962962962
Root Mean Squared Error: 0.3600411499115478
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
#Support Vector Machine Classifier
clf = SVC(C=1.0, probability=True) # since we need probabilities
clf.fit(X_train_vectors_w2v, y_train)
y_predict = clf.predict(X_test_vectors_w2v)
y_prob= clf.predict_proba(X_test_vectors_w2v)
print(classification_report(y_test,y_predict))
print ("logloss: %0.3f " % multiclass_logloss(y_test, y_prob))
#use LOOCV to evaluate model
cv = LeaveOneOut()
scores= cross_val_score(clf, X_train_vectors_w2v, y_train, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
MAE=mean(absolute(scores))
RMSE=sqrt(mean(absolute(scores)))
print ("Mean Absolute Error: ", MAE)
print ("Root Mean Squared Error: ", RMSE)
conf_matrix= confusion_matrix(y_test, y_predict)
ax= sns.heatmap(conf_matrix, annot=True, cmap='Blues')
print('Confusion Matrix:', ax)
precision recall f1-score support
0 0.21 1.00 0.35 4
1 0.00 0.00 0.00 6
2 0.80 1.00 0.89 4
3 0.00 0.00 0.00 10
accuracy 0.33 24
macro avg 0.25 0.50 0.31 24
weighted avg 0.17 0.33 0.21 24
logloss: 1.456
Mean Absolute Error: 2.074074074074074
Root Mean Squared Error: 1.4401645996461911
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
# Fitting a simple xgboost on glove features
clf = xgb.XGBClassifier(nthread=10, learning_rate=0.1, silent=False)
clf.fit(X_train_vectors_w2v, y_train)
#predict y value for dataset
y_predict= clf.predict(X_test_vectors_w2v)
y_prob= clf.predict_proba(X_test_vectors_w2v)
print(classification_report(y_test,y_predict))
print ("logloss: %0.3f " % multiclass_logloss(y_test, y_prob))
#use LOOCV to evaluate model
cv = LeaveOneOut()
scores= cross_val_score(clf, X_train_vectors_w2v, y_train, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
MAE=mean(absolute(scores))
RMSE=sqrt(mean(absolute(scores)))
print ("Mean Absolute Error: ", MAE)
print ("Root Mean Squared Error: ", RMSE)
conf_matrix= confusion_matrix(y_test, y_predict)
ax= sns.heatmap(conf_matrix, annot=True, cmap='Blues')
print('Confusion Matrix:', ax)
[03:43:32] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:576:
Parameters: { "silent" } might not be used.
This could be a false alarm, with some parameters getting used by language bindings but
then being mistakenly passed down to XGBoost core, or some parameter actually being used
but getting flagged wrongly here. Please open an issue if you find any such cases.
[03:43:32] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 0.67 0.80 6
2 1.00 1.00 1.00 4
3 0.83 1.00 0.91 10
accuracy 0.92 24
macro avg 0.96 0.92 0.93 24
weighted avg 0.93 0.92 0.91 24
logloss: 0.337
Mean Absolute Error: 0.37037037037037035
Root Mean Squared Error: 0.6085806194501846
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
logistic_regression(X_train_vectors_w2v, y_train, X_test_vectors_w2v , y_test)
precision recall f1-score support
0 0.44 1.00 0.62 4
1 0.00 0.00 0.00 6
2 0.57 1.00 0.73 4
3 0.88 0.70 0.78 10
accuracy 0.62 24
macro avg 0.47 0.68 0.53 24
weighted avg 0.53 0.62 0.55 24
logloss: 1.205
Mean Absolute Error: 0.5370370370370371
Root Mean Squared Error: 0.73282810879294
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
decisiontree_classifier(X_train_vectors_w2v, y_train, X_test_vectors_w2v , y_test)
precision recall f1-score support
0 0.80 1.00 0.89 4
1 0.80 0.67 0.73 6
2 1.00 0.75 0.86 4
3 0.82 0.90 0.86 10
accuracy 0.83 24
macro avg 0.85 0.83 0.83 24
weighted avg 0.84 0.83 0.83 24
logloss: 5.756
Mean Absolute Error: 0.6111111111111112
Root Mean Squared Error: 0.7817359599705717
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
randomforest_classifier(X_train_vectors_w2v, y_train, X_test_vectors_w2v , y_test)
precision recall f1-score support
0 0.67 1.00 0.80 4
1 1.00 0.17 0.29 6
2 1.00 1.00 1.00 4
3 0.77 1.00 0.87 10
accuracy 0.79 24
macro avg 0.86 0.79 0.74 24
weighted avg 0.85 0.79 0.73 24
logloss: 0.452
Mean Absolute Error: 0.24074074074074073
Root Mean Squared Error: 0.49065338146265813
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
kneighbors_classifier(X_train_vectors_w2v, y_train, X_test_vectors_w2v , y_test)
precision recall f1-score support
0 0.30 0.75 0.43 4
1 0.50 0.17 0.25 6
2 0.80 1.00 0.89 4
3 0.71 0.50 0.59 10
accuracy 0.54 24
macro avg 0.58 0.60 0.54 24
weighted avg 0.61 0.54 0.53 24
logloss: 3.501
Mean Absolute Error: 0.7222222222222222
Root Mean Squared Error: 0.8498365855987975
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
# Fitting a simple xgboost on glove features
clf = xgb.XGBClassifier(nthread=10, learning_rate=0.1, silent=False)
clf.fit(xtrain_glove, y_train)
#predict y value for dataset
y_predict= clf.predict(xtest_glove)
y_prob= clf.predict_proba(xtest_glove)
print(classification_report(y_test,y_predict))
print ("logloss: %0.3f " % multiclass_logloss(y_test, y_prob))
#use LOOCV to evaluate model
cv = LeaveOneOut()
scores= cross_val_score(clf, xtrain_glove, y_train, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
MAE=mean(absolute(scores))
RMSE=sqrt(mean(absolute(scores)))
print ("Mean Absolute Error: ", MAE)
print ("Root Mean Squared Error: ", RMSE)
conf_matrix= confusion_matrix(y_test, y_predict)
ax= sns.heatmap(conf_matrix, annot=True, cmap='Blues')
print('Confusion Matrix:', ax)
[03:43:42] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:576:
Parameters: { "silent" } might not be used.
This could be a false alarm, with some parameters getting used by language bindings but
then being mistakenly passed down to XGBoost core, or some parameter actually being used
but getting flagged wrongly here. Please open an issue if you find any such cases.
[03:43:42] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
precision recall f1-score support
0 0.80 1.00 0.89 4
1 1.00 0.67 0.80 6
2 0.80 1.00 0.89 4
3 0.90 0.90 0.90 10
accuracy 0.88 24
macro avg 0.88 0.89 0.87 24
weighted avg 0.89 0.88 0.87 24
logloss: 0.364
Mean Absolute Error: 0.16666666666666666
Root Mean Squared Error: 0.408248290463863
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
svm_classifier(xtrain_glove , y_train,xtest_glove , y_test)
precision recall f1-score support
0 0.31 1.00 0.47 4
1 0.00 0.00 0.00 6
2 0.36 1.00 0.53 4
3 0.00 0.00 0.00 10
accuracy 0.33 24
macro avg 0.17 0.50 0.25 24
weighted avg 0.11 0.33 0.17 24
logloss: 2.612
Mean Absolute Error: 1.7037037037037037
Root Mean Squared Error: 1.3052600138300812
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
logistic_regression(xtrain_glove , y_train,xtest_glove , y_test)
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 0.83 0.91 6
2 0.80 1.00 0.89 4
3 1.00 1.00 1.00 10
accuracy 0.96 24
macro avg 0.95 0.96 0.95 24
weighted avg 0.97 0.96 0.96 24
logloss: 0.734
Mean Absolute Error: 0.2037037037037037
Root Mean Squared Error: 0.45133546692422
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
decisiontree_classifier(xtrain_glove , y_train,xtest_glove , y_test)
precision recall f1-score support
0 0.40 1.00 0.57 4
1 0.00 0.00 0.00 6
2 0.80 1.00 0.89 4
3 0.89 0.80 0.84 10
accuracy 0.67 24
macro avg 0.52 0.70 0.58 24
weighted avg 0.57 0.67 0.59 24
logloss: 11.513
Mean Absolute Error: 0.5370370370370371
Root Mean Squared Error: 0.73282810879294
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
randomforest_classifier(xtrain_glove , y_train,xtest_glove , y_test)
precision recall f1-score support
0 0.67 1.00 0.80 4
1 1.00 0.67 0.80 6
2 1.00 1.00 1.00 4
3 1.00 1.00 1.00 10
accuracy 0.92 24
macro avg 0.92 0.92 0.90 24
weighted avg 0.94 0.92 0.92 24
logloss: 0.498
Mean Absolute Error: 0.07407407407407407
Root Mean Squared Error: 0.2721655269759087
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
kneighbors_classifier(xtrain_glove , y_train,xtest_glove , y_test)
precision recall f1-score support
0 1.00 1.00 1.00 4
1 0.71 0.83 0.77 6
2 0.80 1.00 0.89 4
3 1.00 0.80 0.89 10
accuracy 0.88 24
macro avg 0.88 0.91 0.89 24
weighted avg 0.90 0.88 0.88 24
logloss: 0.164
Mean Absolute Error: 0.2037037037037037
Root Mean Squared Error: 0.45133546692422
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
# scale the data before any neural net:
scl = preprocessing.StandardScaler()
xtrain_glove_scl = scl.fit_transform(xtrain_glove)
xvalid_glove_scl = scl.transform(xtest_glove)
# we need to binarize the labels for the neural net
ytrain_enc = np_utils.to_categorical(y_train, num_classes=5)
yvalid_enc = np_utils.to_categorical(y_test, num_classes=5)
# create a simple 3 layer sequential neural net
model = Sequential()
model.add(Dense(300, input_dim=300, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.3))
model.add(BatchNormalization())
model.add(Dense(5))
model.add(Activation('softmax'))
# compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history=model.fit(xtrain_glove_scl, y=ytrain_enc, batch_size=15, epochs=50, verbose=1,validation_data=(xvalid_glove_scl, yvalid_enc))
Epoch 1/50 4/4 [==============================] - 8s 329ms/step - loss: 1.2939 - accuracy: 0.4815 - val_loss: 0.6150 - val_accuracy: 0.9583 Epoch 2/50 4/4 [==============================] - 0s 33ms/step - loss: 0.1915 - accuracy: 0.9074 - val_loss: 0.3532 - val_accuracy: 0.9583 Epoch 3/50 4/4 [==============================] - 0s 20ms/step - loss: 0.1272 - accuracy: 0.9259 - val_loss: 0.2956 - val_accuracy: 0.9583 Epoch 4/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0622 - accuracy: 0.9815 - val_loss: 0.2679 - val_accuracy: 0.9583 Epoch 5/50 4/4 [==============================] - 0s 19ms/step - loss: 0.0115 - accuracy: 1.0000 - val_loss: 0.2336 - val_accuracy: 0.9583 Epoch 6/50 4/4 [==============================] - 0s 19ms/step - loss: 0.0349 - accuracy: 0.9815 - val_loss: 0.2157 - val_accuracy: 0.9583 Epoch 7/50 4/4 [==============================] - 0s 19ms/step - loss: 0.0157 - accuracy: 1.0000 - val_loss: 0.2112 - val_accuracy: 0.9583 Epoch 8/50 4/4 [==============================] - 0s 19ms/step - loss: 0.1090 - accuracy: 0.9444 - val_loss: 0.2265 - val_accuracy: 0.9583 Epoch 9/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0106 - accuracy: 1.0000 - val_loss: 0.2405 - val_accuracy: 0.9583 Epoch 10/50 4/4 [==============================] - 0s 19ms/step - loss: 0.0016 - accuracy: 1.0000 - val_loss: 0.2505 - val_accuracy: 0.9583 Epoch 11/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0051 - accuracy: 1.0000 - val_loss: 0.2632 - val_accuracy: 0.9583 Epoch 12/50 4/4 [==============================] - 0s 19ms/step - loss: 0.0010 - accuracy: 1.0000 - val_loss: 0.2685 - val_accuracy: 0.9583 Epoch 13/50 4/4 [==============================] - 0s 27ms/step - loss: 0.0017 - accuracy: 1.0000 - val_loss: 0.2690 - val_accuracy: 0.9583 Epoch 14/50 4/4 [==============================] - 0s 19ms/step - loss: 0.0201 - accuracy: 1.0000 - val_loss: 0.2592 - val_accuracy: 0.9583 Epoch 15/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0266 - accuracy: 0.9815 - val_loss: 0.2484 - val_accuracy: 0.9583 Epoch 16/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0025 - accuracy: 1.0000 - val_loss: 0.2498 - val_accuracy: 0.9583 Epoch 17/50 4/4 [==============================] - 0s 21ms/step - loss: 0.0016 - accuracy: 1.0000 - val_loss: 0.2501 - val_accuracy: 0.9583 Epoch 18/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0038 - accuracy: 1.0000 - val_loss: 0.2561 - val_accuracy: 0.9167 Epoch 19/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0036 - accuracy: 1.0000 - val_loss: 0.2639 - val_accuracy: 0.9167 Epoch 20/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0062 - accuracy: 1.0000 - val_loss: 0.2695 - val_accuracy: 0.9167 Epoch 21/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0054 - accuracy: 1.0000 - val_loss: 0.2826 - val_accuracy: 0.8750 Epoch 22/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0198 - accuracy: 1.0000 - val_loss: 0.2972 - val_accuracy: 0.8750 Epoch 23/50 4/4 [==============================] - 0s 19ms/step - loss: 0.0013 - accuracy: 1.0000 - val_loss: 0.3121 - val_accuracy: 0.8750 Epoch 24/50 4/4 [==============================] - 0s 19ms/step - loss: 7.9719e-04 - accuracy: 1.0000 - val_loss: 0.3233 - val_accuracy: 0.8750 Epoch 25/50 4/4 [==============================] - 0s 22ms/step - loss: 0.0399 - accuracy: 1.0000 - val_loss: 0.3236 - val_accuracy: 0.8750 Epoch 26/50 4/4 [==============================] - 0s 22ms/step - loss: 0.0746 - accuracy: 0.9630 - val_loss: 0.2675 - val_accuracy: 0.9583 Epoch 27/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0021 - accuracy: 1.0000 - val_loss: 0.2223 - val_accuracy: 0.9583 Epoch 28/50 4/4 [==============================] - 0s 21ms/step - loss: 2.3105e-04 - accuracy: 1.0000 - val_loss: 0.1969 - val_accuracy: 0.9583 Epoch 29/50 4/4 [==============================] - 0s 21ms/step - loss: 0.0787 - accuracy: 0.9815 - val_loss: 0.1813 - val_accuracy: 0.9583 Epoch 30/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0023 - accuracy: 1.0000 - val_loss: 0.1730 - val_accuracy: 0.9583 Epoch 31/50 4/4 [==============================] - 0s 21ms/step - loss: 0.0043 - accuracy: 1.0000 - val_loss: 0.1699 - val_accuracy: 0.9583 Epoch 32/50 4/4 [==============================] - 0s 21ms/step - loss: 0.0032 - accuracy: 1.0000 - val_loss: 0.1661 - val_accuracy: 0.9583 Epoch 33/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0623 - accuracy: 0.9630 - val_loss: 0.1499 - val_accuracy: 0.9583 Epoch 34/50 4/4 [==============================] - 0s 21ms/step - loss: 0.0029 - accuracy: 1.0000 - val_loss: 0.1701 - val_accuracy: 0.9583 Epoch 35/50 4/4 [==============================] - 0s 19ms/step - loss: 8.2432e-04 - accuracy: 1.0000 - val_loss: 0.1843 - val_accuracy: 0.9583 Epoch 36/50 4/4 [==============================] - 0s 19ms/step - loss: 0.1264 - accuracy: 0.9815 - val_loss: 0.1697 - val_accuracy: 0.9583 Epoch 37/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0453 - accuracy: 0.9815 - val_loss: 0.0992 - val_accuracy: 0.9583 Epoch 38/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0011 - accuracy: 1.0000 - val_loss: 0.0698 - val_accuracy: 1.0000 Epoch 39/50 4/4 [==============================] - 0s 27ms/step - loss: 0.0083 - accuracy: 1.0000 - val_loss: 0.0691 - val_accuracy: 1.0000 Epoch 40/50 4/4 [==============================] - 0s 20ms/step - loss: 9.5187e-04 - accuracy: 1.0000 - val_loss: 0.0688 - val_accuracy: 1.0000 Epoch 41/50 4/4 [==============================] - 0s 22ms/step - loss: 0.0024 - accuracy: 1.0000 - val_loss: 0.0684 - val_accuracy: 1.0000 Epoch 42/50 4/4 [==============================] - 0s 26ms/step - loss: 0.0035 - accuracy: 1.0000 - val_loss: 0.0693 - val_accuracy: 1.0000 Epoch 43/50 4/4 [==============================] - 0s 21ms/step - loss: 0.0059 - accuracy: 1.0000 - val_loss: 0.0770 - val_accuracy: 0.9583 Epoch 44/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0016 - accuracy: 1.0000 - val_loss: 0.0860 - val_accuracy: 0.9583 Epoch 45/50 4/4 [==============================] - 0s 20ms/step - loss: 8.4599e-04 - accuracy: 1.0000 - val_loss: 0.0928 - val_accuracy: 0.9583 Epoch 46/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0045 - accuracy: 1.0000 - val_loss: 0.0949 - val_accuracy: 0.9583 Epoch 47/50 4/4 [==============================] - 0s 21ms/step - loss: 0.0028 - accuracy: 1.0000 - val_loss: 0.0888 - val_accuracy: 0.9583 Epoch 48/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0173 - accuracy: 1.0000 - val_loss: 0.0813 - val_accuracy: 0.9583 Epoch 49/50 4/4 [==============================] - 0s 20ms/step - loss: 0.0020 - accuracy: 1.0000 - val_loss: 0.0652 - val_accuracy: 0.9583 Epoch 50/50 4/4 [==============================] - 0s 20ms/step - loss: 2.8810e-04 - accuracy: 1.0000 - val_loss: 0.0550 - val_accuracy: 0.9583
# evaluate the keras model
scores = model.evaluate(xvalid_glove_scl, yvalid_enc)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
1/1 [==============================] - 0s 34ms/step - loss: 0.0550 - accuracy: 0.9583 accuracy: 95.83%
# list all data in history
print(history.history.keys())
# summarize history for accuracy
import matplotlib.pyplot as plt
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 70
token.fit_on_texts(list(X_train) + list(X_test))
xtrain_seq = token.texts_to_sequences(X_train)
xvalid_seq = token.texts_to_sequences(X_test)
# zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)
word_index = token.word_index
# create an embedding matrix for the words we have in the dataset
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
100%|██████████████████████████████████████████████████████████████████████████| 3846/3846 [00:00<00:00, 171896.61it/s]
# A simple LSTM with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=max_len,
trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(80, dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(160, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(160, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(5))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 70, 300) 1154100
spatial_dropout1d (SpatialD (None, 70, 300) 0
ropout1D)
lstm (LSTM) (None, 80) 121920
dense_3 (Dense) (None, 160) 12960
dropout_2 (Dropout) (None, 160) 0
dense_4 (Dense) (None, 160) 25760
dropout_3 (Dropout) (None, 160) 0
dense_5 (Dense) (None, 5) 805
activation_1 (Activation) (None, 5) 0
=================================================================
Total params: 1,315,545
Trainable params: 161,445
Non-trainable params: 1,154,100
_________________________________________________________________
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
history= model.fit(xtrain_pad, y=ytrain_enc, batch_size=15, epochs=100, verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])
Epoch 1/100 4/4 [==============================] - 6s 141ms/step - loss: 1.5936 - accuracy: 0.2593 - val_loss: 1.6159 - val_accuracy: 0.1250 Epoch 2/100 4/4 [==============================] - 0s 48ms/step - loss: 1.5179 - accuracy: 0.3148 - val_loss: 1.6090 - val_accuracy: 0.1667 Epoch 3/100 4/4 [==============================] - 0s 46ms/step - loss: 1.4625 - accuracy: 0.3519 - val_loss: 1.6086 - val_accuracy: 0.2083 Epoch 4/100 4/4 [==============================] - 0s 48ms/step - loss: 1.3947 - accuracy: 0.4259 - val_loss: 1.6013 - val_accuracy: 0.2083 Epoch 5/100 4/4 [==============================] - 0s 46ms/step - loss: 1.3366 - accuracy: 0.4815 - val_loss: 1.5622 - val_accuracy: 0.2083 Epoch 6/100 4/4 [==============================] - 0s 46ms/step - loss: 1.2298 - accuracy: 0.4815 - val_loss: 1.4744 - val_accuracy: 0.2917 Epoch 7/100 4/4 [==============================] - 0s 47ms/step - loss: 1.1461 - accuracy: 0.5741 - val_loss: 1.4075 - val_accuracy: 0.3333 Epoch 8/100 4/4 [==============================] - 0s 50ms/step - loss: 1.0441 - accuracy: 0.6296 - val_loss: 1.4141 - val_accuracy: 0.4167 Epoch 9/100 4/4 [==============================] - 0s 47ms/step - loss: 0.9612 - accuracy: 0.6481 - val_loss: 1.5222 - val_accuracy: 0.4583 Epoch 10/100 4/4 [==============================] - 0s 46ms/step - loss: 0.8788 - accuracy: 0.7037 - val_loss: 1.4036 - val_accuracy: 0.4583 Epoch 11/100 4/4 [==============================] - 0s 50ms/step - loss: 0.7790 - accuracy: 0.7407 - val_loss: 1.3781 - val_accuracy: 0.4583 Epoch 12/100 4/4 [==============================] - 0s 46ms/step - loss: 0.6384 - accuracy: 0.8148 - val_loss: 1.8212 - val_accuracy: 0.4167 Epoch 13/100 4/4 [==============================] - 0s 47ms/step - loss: 0.4988 - accuracy: 0.8704 - val_loss: 1.6812 - val_accuracy: 0.3333 Epoch 14/100 4/4 [==============================] - 0s 46ms/step - loss: 0.3580 - accuracy: 0.9259 - val_loss: 1.5710 - val_accuracy: 0.3750
# evaluate the keras model
scores = model.evaluate(xvalid_pad, yvalid_enc)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
1/1 [==============================] - 0s 37ms/step - loss: 1.5710 - accuracy: 0.3750 accuracy: 37.50%
# list all data in history
print(history.history.keys())
# summarize history for accuracy
import matplotlib.pyplot as plt
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
data= {'Type of Vectorizer':['CountVectorizer', 'CountVectorizer', 'Tf-Idf', 'CountVectorizer', 'Glove', 'Tf-Idf', 'Glove', 'Tf-Idf', 'Tf-Idf',
'CountVectorizer', 'Glove', 'Tf-Idf', 'CountVectorizer', 'Word2Vec', 'Tf-Idf', 'Glove', 'Word2Vec', 'Glove', 'Word2Vec',
'Word2Vec', 'Glove', 'CountVectorizer', 'Glove', 'Word2Vec', 'Tf-Idf', 'Tf-Idf', 'CountVectorizer', 'CountVectorizer',
'Word2Vec', 'Glove'],
'Machine Learning Model': ['Naïve Bayes Classifier', 'Logistic Regression', 'KNN Classifier', 'XGB Classifier',
'Deep Neural Network- Simple Dense Network', 'XGB Classifier', 'KNN Classifier', 'Naïve Bayes Classifier on GridSearchCV',
'Logistic Regression', 'KNN Classifier', 'XGB Classifier', 'Random Forest Classifier', 'Random Forest Classifier',
'Random Forest Classifier', 'Naïve Bayes Classifier', 'Random Forest Classifier', 'XGB Classifier',
'Logistic Regression', 'Logistic Regression', 'Support Vector Machine Classifier', 'Deep Neural Network- LSTM',
'Naïve Bayes Classifier on GridSearchCV', 'Support Vector Machine Classifier', 'KNN Classifier',
'Support Vector Machine Classifier', 'Decision Tree Classifier', 'Support Vector Machine Classifier',
'Decision Tree Classifier', 'Decision Tree Classifier', 'Decision Tree Classifier'],
'precision':[1.000, 1.000, 1.000, 1.000, 0, 0.980, 0.880, 1.000, 1.000, 0.980, 0.880, 1.000, 0.930, 0.820, 0.950, 0.950, 0.850, 0.950,
0.470, 0.250, 0, 1.000, 0.170, 0.670, 0.580, 0.780, 0.880, 0.780, 0.820, 0.660],
'recall':[1.000, 1.000, 1.000, 1.000, 0, 0.960, 0.910, 1.000, 1.000, 0.960, 0.890, 1.000, 0.920, 0.790, 0.960, 0.960, 0.890, 0.960, 0.680,
0.500, 0, 1.000, 0.500, 0.650, 0.700, 0.750, 0.830, 0.730, 0.870, 0.660],
'f1-score':[1.000, 1.000, 1.000, 1.000, 0, 0.970, 0.890, 1.000, 1.000, 0.970, 0.870, 1.000, 0.910, 0.710, 0.950, 0.950, 0.860, 0.950,
0.530, 0.310, 0, 1.000, 0.250, 0.620, 0.600, 0.760, 0.790, 0.750, 0.830, 0.660],
'MAE':[0.148, 0.093, 0.130, 0.019, 0, 0.056, 0.204, 0.148, 0.130, 0.426, 0.167, 0.074, 0.148, 0.148, 0.185, 0.222, 0.204, 0.333,
0.519, 0, 2.074, 0.111, 2.074, 1.704, 1.630, 0.167, 0.704, 0.204, 0.352, 0.370],
'RMSE':[0.385, 0.304, 0.360, 0.136, 0, 0.236, 0.451, 0.385, 0.360, 0.653, 0.408, 0.272, 0.385, 0.385, 0.430, 0.471, 0.451,
0.577, 0.720, 0, 1.440, 0.333, 1.440, 1.305, 1.277, 0.408, 0.839, 0.451, 0.593, 0.609],
'accuracy':[ 1.000, 1.000, 1.000, 1.000, 0.958, 0.960, 0.880, 1.000, 1.000, 0.960, 0.880, 0.960, 0.920, 0.960, 0.960, 0.790,
0.960, 0.750, 0.620, 0.500, 0.330, 1.000, 0.670, 0.330, 0.830, 0.960, 0.540, 0.920, 0.830, 0.710],
'log-loss':[ 0.000, 0.007, 0.046, 0.112, 0.113, 0.129, 0.164, 0.261, 0.263, 0.291, 0.364, 0.393, 0.467, 0.474, 0.482,
0.548, 0.734, 0.796, 1.203, 1.496, 1.498, 1.919, 2.376, 2.639, 2.741, 2.878, 3.505, 4.317, 5.756, 10.074]
}
df= pd.DataFrame(data)
df= df.sort_values(by=['Type of Vectorizer'], ascending = True)
df.reset_index(inplace=True, drop=True)
df
| Type of Vectorizer | Machine Learning Model | precision | recall | f1-score | MAE | RMSE | accuracy | log-loss | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | CountVectorizer | Naïve Bayes Classifier | 1.00 | 1.00 | 1.00 | 0.148 | 0.385 | 1.000 | 0.000 |
| 1 | CountVectorizer | Logistic Regression | 1.00 | 1.00 | 1.00 | 0.093 | 0.304 | 1.000 | 0.007 |
| 2 | CountVectorizer | Decision Tree Classifier | 0.78 | 0.73 | 0.75 | 0.204 | 0.451 | 0.920 | 4.317 |
| 3 | CountVectorizer | XGB Classifier | 1.00 | 1.00 | 1.00 | 0.019 | 0.136 | 1.000 | 0.112 |
| 4 | CountVectorizer | Support Vector Machine Classifier | 0.88 | 0.83 | 0.79 | 0.704 | 0.839 | 0.540 | 3.505 |
| 5 | CountVectorizer | Naïve Bayes Classifier on GridSearchCV | 1.00 | 1.00 | 1.00 | 0.111 | 0.333 | 1.000 | 1.919 |
| 6 | CountVectorizer | KNN Classifier | 0.98 | 0.96 | 0.97 | 0.426 | 0.653 | 0.960 | 0.291 |
| 7 | CountVectorizer | Random Forest Classifier | 0.93 | 0.92 | 0.91 | 0.148 | 0.385 | 0.920 | 0.467 |
| 8 | Glove | Support Vector Machine Classifier | 0.17 | 0.50 | 0.25 | 2.074 | 1.440 | 0.670 | 2.376 |
| 9 | Glove | Deep Neural Network- LSTM | 0.00 | 0.00 | 0.00 | 2.074 | 1.440 | 0.330 | 1.498 |
| 10 | Glove | Logistic Regression | 0.95 | 0.96 | 0.95 | 0.333 | 0.577 | 0.750 | 0.796 |
| 11 | Glove | Random Forest Classifier | 0.95 | 0.96 | 0.95 | 0.222 | 0.471 | 0.790 | 0.548 |
| 12 | Glove | Decision Tree Classifier | 0.66 | 0.66 | 0.66 | 0.370 | 0.609 | 0.710 | 10.074 |
| 13 | Glove | KNN Classifier | 0.88 | 0.91 | 0.89 | 0.204 | 0.451 | 0.880 | 0.164 |
| 14 | Glove | XGB Classifier | 0.88 | 0.89 | 0.87 | 0.167 | 0.408 | 0.880 | 0.364 |
| 15 | Glove | Deep Neural Network- Simple Dense Network | 0.00 | 0.00 | 0.00 | 0.000 | 0.000 | 0.958 | 0.113 |
| 16 | Tf-Idf | KNN Classifier | 1.00 | 1.00 | 1.00 | 0.130 | 0.360 | 1.000 | 0.046 |
| 17 | Tf-Idf | Decision Tree Classifier | 0.78 | 0.75 | 0.76 | 0.167 | 0.408 | 0.960 | 2.878 |
| 18 | Tf-Idf | Support Vector Machine Classifier | 0.58 | 0.70 | 0.60 | 1.630 | 1.277 | 0.830 | 2.741 |
| 19 | Tf-Idf | XGB Classifier | 0.98 | 0.96 | 0.97 | 0.056 | 0.236 | 0.960 | 0.129 |
| 20 | Tf-Idf | Random Forest Classifier | 1.00 | 1.00 | 1.00 | 0.074 | 0.272 | 0.960 | 0.393 |
| 21 | Tf-Idf | Naïve Bayes Classifier | 0.95 | 0.96 | 0.95 | 0.185 | 0.430 | 0.960 | 0.482 |
| 22 | Tf-Idf | Logistic Regression | 1.00 | 1.00 | 1.00 | 0.130 | 0.360 | 1.000 | 0.263 |
| 23 | Tf-Idf | Naïve Bayes Classifier on GridSearchCV | 1.00 | 1.00 | 1.00 | 0.148 | 0.385 | 1.000 | 0.261 |
| 24 | Word2Vec | Support Vector Machine Classifier | 0.25 | 0.50 | 0.31 | 0.000 | 0.000 | 0.500 | 1.496 |
| 25 | Word2Vec | KNN Classifier | 0.67 | 0.65 | 0.62 | 1.704 | 1.305 | 0.330 | 2.639 |
| 26 | Word2Vec | Logistic Regression | 0.47 | 0.68 | 0.53 | 0.519 | 0.720 | 0.620 | 1.203 |
| 27 | Word2Vec | XGB Classifier | 0.85 | 0.89 | 0.86 | 0.204 | 0.451 | 0.960 | 0.734 |
| 28 | Word2Vec | Decision Tree Classifier | 0.82 | 0.87 | 0.83 | 0.352 | 0.593 | 0.830 | 5.756 |
| 29 | Word2Vec | Random Forest Classifier | 0.82 | 0.79 | 0.71 | 0.148 | 0.385 | 0.960 | 0.474 |
Lets Understand Confusion Matrix First:
| 1. Precision | Number of True Positives (TP) divided by the Total Number of True Positives (TP) and False Positives (FP).Of all the positive predictions, how many of them are truly positive. |
| 2. Recall | Number of True Positives (TP) divided by the Total Number of True Positives (TP) and False Negatives (FN).Layman definition: Of all the actual positive examples out there, how many of them did I correctly predict to be positive?.If you compare the formula for precision and recall, you will notice both looks similar. The only difference is the second term of the denominator, where it is False Positive for precision but False Negative for recall. |
| 3. F1 Score | Both precision and recall, the F1 score serves as a helpful metric that considers both of them. If we express it in terms of True Positive (TP), False Positive (FP), and False Negative (FN).Use F1 score as an average of recall and precision, especially when working with imbalanced datasets. If either recall or precision is 0, F1 score will reflect that an also be 0. |
| 4. Macro Average | Compute the average without considering the proportion.This method treats all classes equally regardless of their support values. |
| 5. Weighted Average | The weighted-averaged F1 score is calculated by taking the mean of all per-class F1 scores while considering each class’s support. |
| 6. Log Loss | Log Loss is the most important classification metric based on probabilities. log-loss is still a good metric for comparing models.A lower log-loss value means better predictions. |
| 7. Mean Absolute Error (MAE) | MAE is the average absolute error between the model prediction and the actual. The lower the MAE, the more closely a model is able to predict the actual observations. |
| 8. Root Mean Squared Error (RMSE) | The lower the RMSE, the more closely a model is able to predict the actual observations. |
The above table shows the type of vectroziation used for machine learning model and its metricis on precision, recall, f1-score, accuracy and logloss. We have taken the metricis of macro average as our dataset is imbalanced and all our classes are equally important, using the macro average would be a good choice as it treats all classes equally. Since the dataset is very less, the accuracy is good as it is overfitting. In order to handle overfitting, LOOCV has to be performed.
From all the above model metrics, we came up that below models given us good result:
| Type of Vectorizer | Machine Learning Model | precision | recall | f1-score | MAE | RMSE | accuracy | log-loss |
| CountVectorizer | XGB Classifier | 1.000 | 1.000 | 1.000 | 0.019 | 0.136 | 1.000 | 0.112 |
| Tf-Idf | XGB Classifier | 0.980 | 0.960 | 0.970 | 0.056 | 0.236 | 0.960 | 0.129 |
| Tf-Idf | Random Forest Classifier | 1.000 | 1.000 | 1.000 | 0.074 | 0.272 | 0.960 | 0.393 |
| CountVectorizer | Logistic Regression | 1.000 | 1.000 | 1.000 | 0.093 | 0.304 | 1.000 | 0.007 |
| Glove | Deep Neural Network- Simple Dense Network | - | - | - | - | - | 0.958 | 0.113 |
So, we came up that model XGB Classifier under Tf-Idf vectorization is giving us least MAE error with good accuracy and least log-loss as this model is not overfitting the data.
Although we are not sure that this model is the best as our dataset is less, including more data would give us best result.
final_model= xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8,
subsample=0.8, nthread=10, learning_rate=0.1)
final_model.fit(X_train_vectors_tfidf , y_train)
[11:34:33] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.8,
enable_categorical=False, gamma=0, gpu_id=-1,
importance_type=None, interaction_constraints='',
learning_rate=0.1, max_delta_step=0, max_depth=7,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=200, n_jobs=10, nthread=10, num_parallel_tree=1,
objective='multi:softprob', predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=None, subsample=0.8,
tree_method='exact', validate_parameters=1, verbosity=None)
y_predict = final_model.predict(X_test_vectors_tfidf )
y_prob= final_model.predict_proba(X_test_vectors_tfidf)
print(classification_report(y_test,y_predict))
print ("logloss: %0.3f " % multiclass_logloss(y_test, y_prob))
#use LOOCV to evaluate model
cv = LeaveOneOut()
scores= cross_val_score(final_model, X_train_vectors_tfidf, y_train, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
MAE=mean(absolute(scores))
RMSE=sqrt(mean(absolute(scores)))
print ("Mean Absolute Error: ", MAE)
print ("Root Mean Squared Error: ", RMSE)
conf_matrix= confusion_matrix(y_test, y_predict)
ax= sns.heatmap(conf_matrix, annot=True, cmap='Blues')
print('Confusion Matrix:', ax)
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 0.83 0.91 6
2 1.00 1.00 1.00 4
3 0.91 1.00 0.95 10
accuracy 0.96 24
macro avg 0.98 0.96 0.97 24
weighted avg 0.96 0.96 0.96 24
logloss: 0.129
Mean Absolute Error: 0.05555555555555555
Root Mean Squared Error: 0.23570226039551584
Confusion Matrix: AxesSubplot(0.125,0.125;0.62x0.755)
fm= xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8,
subsample=0.8, nthread=10, learning_rate=0.1)
fm.fit(X_train_vectors_tfidf, y_train)
[11:43:01] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.8,
enable_categorical=False, gamma=0, gpu_id=-1,
importance_type=None, interaction_constraints='',
learning_rate=0.1, max_delta_step=0, max_depth=7,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=200, n_jobs=10, nthread=10, num_parallel_tree=1,
objective='multi:softprob', predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=None, subsample=0.8,
tree_method='exact', validate_parameters=1, verbosity=None)
#converting words to numerical data using tf-idf
X_check= NLP_data['clean_text']
X_vector= tfidf_vectorizer.transform(X_check)
#use the best model to predict 'target' value for the new dataset
y_predict = fm.predict(X_vector)
y_prob = fm.predict_proba(X_vector)[:,1]
NLP_data['predict_prob']= y_prob
NLP_data['target']= y_predict
final=NLP_data[['clean_text','Label','LabelId', 'target']].reset_index(drop=True)
final= pd.DataFrame(final)
final
| clean_text | Label | LabelId | target | |
|---|---|---|---|---|
| 0 | anubhav kumar singh core competency script she... | Peoplesoft resumes | 0 | 0 |
| 1 | g ananda rayudu http www linkedin com anandgud... | Peoplesoft resumes | 0 | 0 |
| 2 | peoplesoft database administrator gangareddy p... | Peoplesoft resumes | 0 | 0 |
| 3 | classification internal classification interna... | Peoplesoft resumes | 0 | 0 |
| 4 | priyanka ramadoss mountpleasant coonoor nilgir... | Peoplesoft resumes | 0 | 0 |
| ... | ... | ... | ... | ... |
| 73 | pranish sonone career summary experience year ... | React JS | 3 | 3 |
| 74 | ranga gaganam professional summary professiona... | React JS | 3 | 3 |
| 75 | shaik abdul sharuk year experience wipro caree... | React JS | 3 | 3 |
| 76 | name ravali p curriculum vitae specialization ... | Internship | 4 | 4 |
| 77 | susovan bag seek challenging position field sc... | Internship | 4 | 4 |
78 rows × 4 columns
from joblib import dump, load
import joblib
dump(fm, 'textclassification.pkl')
['textclassification.pkl']
We have walked through a complete end-to-end machine learning project using the employee Resume File. We started by converting the .doc file to .docx, and then extracting the text from the .docx & .pdf file and storing the data as .csv file to make a dataframe. Then we implemented various exploratory data analysis method for text data. Through which we came to know about sentiment analysis, top words used in document by wordcloud, information used in resume by Named entity recognition and lastly Topic Model Analysis by which we can similarity between topics and relevant words used under them. We used basics of building a text classification model comparing Bag-of-Words (with Tf-Idf & Countvectorizer) and Word Embedding( with Word2Vec & Glove). Finally, we trained a variety of classifiers and dense neural network and got the model accuracy with classification report, logloss, MAE, RMSE and confusion matrix heatmap.